This article was written by Rahul Pathak, vice president of relational database engines at AWS
Integrating data across an organization can give you a better picture of your customers, streamline your operations, and help teams make better, faster decisions. But integrating data isn’t easy.
Often, organizations gather data from different sources, using a variety of tools and systems such as data ingestion services. Data is often stored in silos, which means it has to be moved into a data lake or data warehouse before analytics, artificial intelligence (AI), or machine learning (ML) workloads can be run. And before that data is ready for analysis, it needs to be combined, cleaned, and normalized—a process otherwise known as extract, transform, load (ETL)—which can be laborious and error-prone.
At AWS, our goal is to make it easier for organizations to connect to all of their data, and to do it with the speed and agility our customers need. We’ve developed our pioneering approach to a zero-ETL future based on these goals: Break down data silos, make data integration easier, and increase the pace of your data-driven innovation.
The problem with ETL
Combining data from different sources can be like moving a pile of gravel from one place to another— it’s difficult, time-consuming, and often unsatisfying work. First, ETL frequently requires data engineers to write custom code. Then, DevOps engineers or IT administrators have to deploy and manage the infrastructure to make sure the data pipelines scale. And when the data sources change, the data engineers have to manually change their code and deploy it again.
Furthermore, when data engineers run into issues, such as data replication lag, breaking schema updates, and data inconsistency between the sources and destinations, they have to spend time and resources debugging and repairing the data pipelines. While the data is being prepared—a process that can take days—data analysts can’t run interactive analyses or build dashboards, data scientists can’t build ML models or run predictions, and end users, such as supply chain managers, can’t make data-driven decisions.
This lengthy process kills the opportunity for any real-time use cases, such as assigning drivers to routes based on traffic conditions, placing online ads, or…