What is Data Engineering?

The essence

At its core, data engineering is the process of ingesting, storing and transforming data into a format that is suitable for analysis. This can include cleaning and organizing the data, as well as applying transformations to make it more usable. The goal is to make the data ready for use in analytical applications, reporting and other purposes, such as Machine Learning.

The difference between operational systems and analytical systems

The databases that power web applications are often not suited to for analytical purposes. Their compute power is limited, and interferes with the purposes they were initially setup for, mostly transactional queries. These databases are optimized for reading and writing individual rows of data very quickly. Examples of typical operational databases are MySQL, PostgreSQL or Oracle Database.

A common practice of Data Engineers is extracting the data from those systems and offload them to analytical databases are designed to scale. Those databases can handle large queries and are designed to return large query results faster than operational databases. These analytical databases are often referred to as Data Warehouse systems or Data Lakes.

Examples of typical cloud analytical databases are Databricks, Snowflake, Redshift, BigQuery or Synapse Analytics.

Data Engineers know their way around these analytical databases. They typically know how to setup the necessary tools, ingest the data, create tables and optimize query results.

While analytical databases are a very important component for analytics and vendors of those databases are bringing in as many features out of the box to support the full lifecycle, there's more needed than just storing and querying data.

The Data Engineering lifecycle

In their book Fundamentals of Data Engineering Joe Reis and Matt Hously describe the scope of Data Engineering by a number of main components: Ingestion, Transformation, Serving and Storage. Their diagram below shows these components and their interaction.

Taken from the book Fundamentals of Data Engineering (2022) by Joe Reis & Matt Housley

However, Data Engineering is more than just moving a dataset from A to B. In this same diagram Hously and Reis propose six undercurrents, which are essentially processes that support the whole data engineering lifecycle, namely: Security, Data Management, DataOps, Data Architecture, Orchestration and Software Engineering.

Programming languages

Data Engineers are generally proficient in one or more object-oriented programming languages, with Python topping the charts. The versatility and ease of use makes it a language that became the de-facto language for most programmatic data pipelines. As the Data Science ecosystem is also heavily reliant on Python, it makes it for Data Scientists a relatively easy to transition into the Data Engineering field.

An alternative used for Python in Data Engineering would be Scala, this robust programming language is chosen less, as the entry level to work with the language is higher. Some engineers don't like working with this language because of it's high verbosity and high level of syntax rules. While other engineers prefer it because of these very same reasons. If you are a beginning programmer I would advice to stick to Python.

Since Data Engineers work with structured data, often stored in databases, SQL is another important language in their tool-set. With its easy to read syntax and its long history, SQL is one of the main languages to query data.

Subscribe to dataengineering.dev

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe