What is Data Engineering, and What Does a Data Engineer Do?

Data engineering is very popular these days, which begs the question - what is data engineering? And what does a data engineer do? We’re often asked this question both by people interested in becoming data engineers, as well as those who want to hire them.

Though data engineering has existed in some form since companies have done things with data (BI engineer, ETL developer, etc), it came into sharp focus alongside the rise of data science in the 2010’s. Data engineering is now arguably as in-demand as data science as companies realize that building a solid data foundation is critical for data science success.

But what is data engineering, exactly? Let’s start off with some real talk - there are endless definitions of data engineering. A Google exact-match search for “what is data engineering?” returns over 91,000 unique results. A cursory look at the various definitions yields a variety of answers, most of which are either out of date or specific to particular technologies. Let’s step back and define data engineering in a general context.

We define data engineering as the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of data management, DataOps, data architecture, orchestration, and software engineering.

Put another way, data engineering produces reliable data that serves the business with predictable quality and meaning, built atop systems supporting this data. Data engineering systems and outputs are the backbone of successful data analytics and data science.

Now that we’ve defined data engineering, you’re probably wondering what a data engineer does. A data engineer applies the discipline of data engineering, taking raw data and making it useful to downstream users, such as data scientists, analysts, and AI/ML engineers. A data engineer manages the data engineering lifecycle (to be covered in an upcoming article) from source systems to the serving data for downstream use cases, such as analysis or machine learning.

A big question that comes up - what technologies should a data engineer know? We think this is increasingly the wrong question to ask.

In the recent past, a data engineer was expected to know and understand how to use a small handful of powerful and monolithic technologies to create a data solution. Utilizing these technologies (Hadoop, Spark, Teradata, and many others) often required a sophisticated understanding of software engineering, networking, distributed computing, storage, or other low level details. A data engineer’s work would be devoted to cluster administration and maintenance, managing overhead, and writing pipeline and transformation jobs, among other highly technical tasks.

Nowadays, the data tooling landscape is dramatically less complicated to manage and deploy. Recent tools greatly abstract and simplify workflows, and as a result, the role of a data engineer is evolving from being an expert in a handful of complicated technologies, toward balancing the simplest and most cost effective best of breed services that deliver value to the business. The data engineer is also expected to create agile data architectures that evolve as new trends emerge.

Especially in today’s data ecosystem where tools, technologies, and practices are evolving at a blinding rate, it’s easy to get distracted by shiny objects and think that data engineering is the equivalent of using the latest “technology X” (X = Databricks, Snowflake, dbt, etc, etc). Data engineering is not - and never has been - about any particular technology. Data engineering is about designing, building, and maintaining systems that incorporate best of breed technologies and practices in an agile and cost effective way.

What are some things a data engineer does NOT do? A data engineer typically does not directly build machine learning models, create reports or dashboards, perform data analysis, build KPIs, or develop software applications. That said, a data engineer should have a good foundational understanding of all of these areas, for the purposes of consuming data from or providing it to stakeholders in these domains, and effectively collaborating.

Now that you have an understanding of what data engineering is, and what a data engineer does, hopefully you have better context around how this role fits into your situation. If you’re interested in becoming a data engineer, this is the perfect time to enter the field. There’s huge demand in the field, and many job opportunities. . If you’re looking for data engineers, you have a better idea of what to look for. (Granted, the available pool of good candidates is pretty small right now).

Stay tuned for more articles on data engineering, including the data engineering lifecycle, the undercurrents of data engineering, and much more.

Joseph ReisJuly 14, 2021