Modernizing Data Engineering through Modular DataInfraOps

By Sandeep Madamanchi

Elevator Pitch

Data often determines the winners and the losers in an industry. Data engineers develop data architectures by loosely sewing together services with a focus on operational excellence in every stitch. The selection of modular cloud services has never been more directly correlated to business success.

Description

Data is omnipresent. Using data to effectively drive business decisions is the ultimate goal of analytical groups but clean and accurate data are the critical prerequisites. Data Engineers aim to solve for those prerequisites by leveraging well-established patterns and modern day software components to provide a centralized repository of structured, unstructured, and semi-structured data sources; enter the data lake.

Building and hydrating a scalable, highly-available, resilient data lake is what all data engineering organizations aspire to. The height of that hurdle depends on the variety of data categories, the expected velocity of the ingestion processes, and the sheer volume of the data sets in question. When enough data comes into play, we refer to it as Big Data and traditional, vertically-scalable processes no longer suffice. Over the years, data engineering organizations have shifted their toolset to take advantage of various open source projects such as Hadoop, Hive, Spark, flink, Presto etc. which are uniquely suited to distribute large, compute-intensive workloads across commodity hardware. Today, cloud providers offer managed – often serverless – services built atop the shoulders of open source projects to address problem areas, such as, the streaming of high-speed clickstream data or the handling of relatively static batch processes which oftentimes handle multiple GBs or TBs of data. A compilation of these services can be used in conjunction to solve for the majority of use cases. But despite the maturity of these tools, challenges continue to persist and cloud technologies/patterns continue to evolution and open up new avenues. As a result, it’s a best practice to minimize coupling between your components today so your architectures may continue to evolve with the tools of tomorrow.

As requirements and data landscapes change, the most drastic of circumstances may justify cloud provider migrations as the benefits of one provider outweigh the offerings of another. An effective strategy to make this switch feasible is to opt for a modular approach by using pluggable and compartmentalized infrastructure. The impact of this change can be measured by defining and capturing key operational metrics. To make these sorts of adaptations tenable, changes in infrastructure and the establishment of operational metrics can be handled by infrastructure as code(IAC).

In addition to leveraging the benefits of IAC, teams should work to characterize their processes and define a finite set of re-usable cloud native patterns that take full advantage of modern cloud offerings. The selection, provisioning, and deployment of these services become an infrastructure-oriented concern and the management of these services become an operational task of their own. This shifts the conversation from writing code that handles specific tasks to constructing parameterized workflows and significantly reducing lines of code – results in a more holistic approach.

Data engineering is the synergy of software engineering, infrastructure provisioning, and operational observability. Data engineers must still consider things such as compression, optimized columnar formats, and data partitioning. But the biggest rewards are reaped from the development of a modular and well-selected combination of native cloud services that map well to your data’s requirements and business’ needs. This opens up opportunities for modern infrastructure teams – DevOps / SRE – to become an unconventional part of the data engineering process in the development of sustainable data-driven organizations. Welcome DataInfraOps.

Notes

We have build a highly sustainable and scalable data lake using modern techniques of treating infrastructure as code. Currently we ingest data from on-prem through VPN, external cloud vendors through queues, streaming, topics, api scrapping and database specific sources through CDC or continuous data capture. We achieved this by using appropriate choice of cloud service that suits our use-case and our approach to implement the infrastructure related to the chosen cloud service through infrastructure as code. This makes the process repeatable and reliable. We did consider the performance characteristics required while building a data lake such as use of parquet as our data format to store. The data lake is now the repository of dependable data that can be further used by Machine Learning scientists for predictive analysis or for engineering teams for operational goals.