What are ETL and ELT?
ETL and ELT are techniques for data processing. They describe methods to take data from a source system and store a transformed version of it ready for consumption by end users, whether that be marketeers, data scientists or analysts.
ETL stands for “Extract, Transform, Load” and is the process of extracting data from a source, performing some kind of data processing via a separate server to transform the data to a usable state, and then loading it to some final destination (usually a data warehouse). ETL has been around for as long as the inception of the data warehouse itself.
There’s a clear distinction here between the tools used to transform the data, and where the final data is being stored.
ELT, or (you guessed it…) “Extract, Load, Transform” is a more recent innovation. The change is more than a slight rearrangement of letters however – it describes extracting raw data and loading that straight into a data warehouse with no (or very little) transformation. Once the data is in, then data transformation can be performed using the resources in the data warehouse itself. This transformation can even happen while the data is being loaded.
Simply put, ELT means the same tool that’s used to store the data is also the tool used to transform it.
ETL workloads are usually more compute-intensive, whereas ELT workloads are more storage-intensive. In fact, the switch from ETL to ELT is at its core really just a move from compute to storage.
What has brought about this change?
The key change that has allowed for ELT to become widespread has really been the emergence of cloud data warehouses. Ubiquitous tools like Google BigQuery, Snowflake, AWS Redshift, Azure Synapse Analytics and Delta Lake provide managed services that require little maintenance. They provide decent support for different underlying file types and data structures, and one can store vast amounts of data at relatively low cost.
It’s not just the storage that’s cheap though; the “transformation” layer in these tools is also very performant due to column-oriented storage and the design of the warehouses’ optimisation engine. These compute clusters are designed for MPP (Massive Parallel Processing), and have an interesting history in their own right, like Google’s Dremel engine or Snowflake’s SQL query engine.
New tooling has also arisen in lockstep with the emergence of ELT workflows. Transformation tools such as dbt and the rise of the “analytics engineer” role has allowed for data warehouse pipelines to be built using techniques one would find in modern software engineering projects: versioning, sophisticated testing suites and clean deployment environments. Data Lakehouses provide another new architecture paradigm, marrying the lower cost and more flexible storage of data lakes with the analytical benefits of data warehouses.
One other interesting variation in ETL is “TEL” (Transform, Extract, Load). This can be used when it’s important to keep a record of how data has been extracted from a source system, in the source system itself. The main example here would be in blockchain technology; migrating data from one blockchain to another will need a record of the transformation applied. Data could be transformed on the source blockchain before it is extracted and loaded to the target blockchain.
ETL vs ELT
There are a lot of blogs out there on this topic, often written by existing tools that are designed around either ETL or ELT. Data integration services might tell you ETL is still the king, whereas tools built on cloud data warehouses might tell you to make the switch to ELT.
ELT has some pretty obvious advantages:
- It simplifies the tech stack as there is no need for a separate transformation server, this often reduces cost
- Performance is usually better, data is loaded and transformed often more quickly
- The underlying infrastructure scales automatically as your data throughput changes
- Development of pipelines is often quicker
In addition to the above, having a separate transformation layer can tend to “tidy” away some of the subtleties in the source data. Indeed, in our experience, we found that getting your hands dirty with source data, in the same system as your prepared data sets, makes it that bit easier to reason about the source data itself.
There are still a number of reasons why ETL would be a preferred approach, however.
- It allows for the removal of unwanted data for compliance reasons, such as removing PII
- ETL has more flexibility with choice of tooling – you are not “tied in” to one product
- ELT only really makes sense with a cloud data warehouse or data lake; those with legacy or custom-built architectures may find ELT infeasible
Which is better, ETL or ELT?
Although these two categories (ETL vs ELT) are useful in distinguishing approaches, as with almost everything in life, the distinction is in reality a lot more blurred. It’s more than reasonable to have an “ETLT” process, where a few very minor choice transformations such as column restriction and masking of sensitive data are performed in the same process that extracts the data (say, a script that extracts source data from an external API). Then once that data is loaded, the vast majority of your transformations are still performed in the data warehouse à la ELT. Moreover, modern data pipelines have a wide variety of data sources and platforms, where a mixture of both ETL and ELT is required.
However, a very simplistic guide is as follows. If you:
- Are using a cloud data warehouse
- Are happy with storing source data
Then give ELT a try – you may find that it helps you speed up your delivery of data pipeline projects for those data-hungry stakeholders.
To read more about our analytics consultancy services and how we can support you...
About the author
Gareth Case
Gareth joined Lynchpin in 2021, coming from a background in Financial Services.
Gareth has experience in building out data warehousing and ETL solutions, for both on-premises and cloud platforms, having worked with Microsoft Azure and Google’s Cloud Platform.
Gareth also has experience in digital measurement including Adobe Analytics and Google Analytics, along with implementation via tag management systems.
Gareth holds a PhD in Mathematics and loves to automate everything he can!