ETL vs. ELT is a Red Herring

Anthony Gatti
7 min readDec 1, 2022

--

These thoughts are my own and do not represent my current or former employers.

If you live in the data world, you’ve probably seen plenty of posts, blogs, architecture diagrams, and vendor white-papers about the difference between ETL and ELT. Usually, ELT is presented as the more “modern” approach best suited for the public cloud. I’d like to present that these acronyms and their associated baggage are not broad enough to explain the architectures both in use and available today.

The internet doesn’t need another blog explaining these acronyms and what they mean, but for my purposes it’s worth providing a basic explanation:

  • ETL is the process of extracting data from multiple sources, transforming it via aggregations, joins, standardizations, etc, and loading it into one or more target systems.
  • ELT is the process of extracting data from multiple sources, loading it into a target system or platform, and transforming it on that platform. Doing the transformation on-platform utilizes that platform’s compute engine, usually via a SQL interface.

I believe that presenting this as an either-or is shortsighted. There’s no reason why both patterns can’t be used even together in the same architecture. We might as well have another acronym called ETLT; being of a certain age, I find myself reminded of this:

This post aims to dig into the patterns and present an option for how to think about the architecture of data integration for your use case that yields maximum flexibility for your business needs. When we propose narrow mental frameworks as “best practices,” we can often perpetuate the data plumbing disasters that we are all so familiar with.

Going forward, I think we should use the terms ingestion and transformation to explain what’s going on in our data architectures. Here’s why.

Data Integration Then and Now

ETL was first implemented as a pattern well before the days of the public cloud. The idea behind the pattern and associated products was to standardize the process of loading data from multiple sources into one or more targets, with transformations built into the pipeline. In this context, both the sources and the targets were on your premises. Even if you were getting data from external sources, the common pattern was to use protocols like FTP to “land” data in the front door (often people call this the “demilitarized zone” or DMZ — another awesome technology acronym).

ETL tools in the pre-public cloud era were designed primarily as appliances that you would run on a collection of servers also running in your environment (or, at a smaller scale, as clients on your machine).

The traditional ETL architecture

In the modern age, most enterprises are taking their gobs of data from a variety of different sources spread across both their premises and external systems and moving the data to the public cloud. Many tools and products that load data into target systems (e.g. Fivetran, Airbyte) are designed to excel at picking data up and dumping it somewhere else, without much, if any, logic in between. Then the transformation happens in the target system.

For example, an enterprise might load their data from Oracle, mainframes, flat files, Salesforce, and Workday into Snowflake. Then they’ll do all the standardization, aggregation, quality checks, joins, and “denormalizing” to prepare data for operational analytics all in Snowflake, via SQL or a graphical tool on top of it.

The ELT architecture with the public cloud (note I’m making a simplification in terms of defining transformation as table to table, when it could in fact be object to table)

The “extract & load” part of the ELT process is often combined together into a single term: ingestion. This is a good term for explaining what’s happening, but when used to refer solely to the notion of picking up data and dumping it somewhere else without doing anything in between, it narrows the term, I think, unnecessarily.

What happens if you want, or need, to do some transformation on data before you load it into the target system? That would look like this:

The first part is ingestion; data is being extracted from multiple sources, transformed on the way, and then loaded into the target system. That is to say — it’s ETL. The transformation on the other side — that’s the T in ELT. So we have arrived at ETLT, also known as “leedle leedle leedle lee.”

Why does it matter?

Some readers might object that this first part is “etl” but it’s not capital E capital T capital L, i.e. the architectural pattern plus the best practices, standards, tools, and job titles associated with implementing the pattern (I call this ETL++). But that’s precisely the problem I want to address: do we need to do the entire ETL++ process in the ingestion phase, or can we back up and view this as an end-to-end data integration process? If so, what’s the point of using the ETL and ELT terms?

It’s important to realize that the ability to do ETL as part of your ingestion process is an option. The cloud platform vendors, and probably some systems integrators, would like you to believe that this is an anti-pattern; after all, you’ve bought a brand new, shiny, super powerful race car, so you might as well use it to its full potential. And this is true; there’s a lot to be gained from centralizing your data onto a data platform and pushing most of your compute onto (or “down”) to that platform. You get the most of the platform that you’ve likely made a big investment in, and they are committed to providing you the best performance, along with centralized governance, security, lineage, policies, access control, etc.

But there are cons to ingesting all your data raw into a data platform and then processing it exclusively on that platform.

  • It can be very expensive to do so; you will end up using most, if not all, of your compute credits prepping data for its ultimate analytical use case — and do you even know what that use case is yet?
  • It requires extensive organizational coordination to combine data sources and line them up to achieve the purported nirvana of the “data platform” (I’ll tackle the “is this a data lake or warehouse or lakehouse or bungalow” question another time). The organizational cost is large and often underestimated. It’s the same problem that many companies ran into with Hadoop — taking all your data and “dumping” it onto a platform could very well be creating the proverbial “data swamp.” If you just dump the data in the target system and worry about data quality, standards, and access controls later, have you come any closer to solving your business problems?
  • Perhaps most importantly, depending on your industry, some data might end up on your data platform that you really don’t want to be there.

I’m not here to tell you how you should use your data platform for your business use case. What I have experienced, however, in many different contexts, is the confusion that people have when they adopt a new data platform about what pattern they should follow. There is no one right way, and the options are more varied than one would think initially. That’s why I think, if you’re an enterprise, it’s important to pick a toolset and architecture that can handle multiple patterns and workloads.

What about Reverse ETL?

Ah, yes, another exceptional term: reverse ETL. This is the idea and pattern that platforms like Snowflake, Databricks, and Redshift can serve as the source of data that needs offloaded to a target operational system. For example, Snowflake might serve as the central store of all your enterprise data, but you might need to take that data and offload it back to Salesforce after you’ve enriched the data with information from other systems.

(By the way, if we add L, perhaps with a T before it, to the end of ETLT we’d have ETLTTL, which is even closer to my preferred “leedle leedle leedle lee.”)

I have no problems with the term “reverse ETL” or with the architectural pattern. But as the cloud data platforms evolve to support workloads beyond analytics (Snowflake is certainly heading in that direction), I don’t see any reason why we can’t just call this “ingestion.” Pedantically, perhaps “ingestion” really only highlights something going down the hatch, not coming out the other end; but is it really any different? Are the tools and processes required to successfully build reverse ETL workloads any different than regular EL workloads? Perhaps the answer is yes technologically speaking, but from a product and market perspective, another layer of confusion is not warranted here in my opinion.

In Conclusion — There Are Options

In the end my motivation for writing this is to address anyone in the data community who is unsure about what the right path is for their business and who is confused by the inundation of terms, tools, and platforms. The data landscape, and broader technology landscape, is always and will always be changing, and the tools are advancing, but the problem remains the same: how can we provide value to our business by unlocking insights from data? There is no one pattern that must be followed, and there is more than one way to get the job done. It’s time our “best practices” and industry lingo evolve to reflect that more fully.

--

--

Anthony Gatti

I’m a product manager at a technology startup. I write about lessons I’ve learned about PM and building successful organizations.