Extract

The first phase of an ETL process involves extracting data from different source systems and storing it in a consolidated staging layer that is easier for the transformation tools to access. Data sources are typically heterogenous and are represented by a wide variety of data formats. The staging layer can be transient to cut back on storage demands or to eliminate personally identifiable information (PII) that may be present in the source systems or persisted if PII data is not present and storage is not a concern. The staging layer is typically persisted as files in an object store such as Azure Blob Storage or ADLS.

In Azure, tools such as Azure Logic Apps and ADF allow data engineers to drag and drop activities with a graphical user interface (GUI) that copies data from source systems and land them in the staging layer. These activities can be parameterized to dynamically adjust where the raw data is staged. Custom code options such as Azure Databricks and Azure Functions are also available to extract data with languages such as C#, Python, Scala, and JavaScript. The very nature of these custom code options gives data engineers more control over how extracted data is formatted and staged. Regardless of whether data extraction is done with a GUI-based or code-first tool, data extraction activities can be automated to run on a schedule or event driven based on when new data is added to the source system.

Data can be extracted from a source system a few different ways. Incremental extractions involve only pulling source data that has been recently inserted or updated. This can minimize both the time to extract the necessary source data and the time to transform the new raw records but requires additional logic to determine what data has been changed. For systems that are not capable of identifying which records have changed, a full data extraction needs to take place. This requires having a full copy of the source data being extracted. While that can result in an accurate copy of the source data, it could take longer to extract, and subsequent transformation activities will take longer to run.

Categories

Archives

Extract – Core Data Concepts

Extract

Written by Deborah Lange

Leave a Reply Cancel reply