Augmented Data Lineage: What It Is, and Why It Matters
In recent years, data lineage has been a highly sought-after capability for data management and data governance teams. By now, it has become a critical feature of data catalogs and metadata management solutions, offering a wide range of benefits and applications. These include regulatory compliance, impact analysis, and a faster understanding of the enterprise data landscape.
Typically, data lineage is associated with technical roles, such as ETL developers and data engineers. However, when data lineage is enriched with business metadata, it can become a particularly useful and practical capability for business users.
In this post, we’ll introduce the concept of augmented data lineage as a tool for business users. We will explore how business and analytical roles within enterprises can use it to find data and perform root cause analyses faster while avoiding corporate red tape.
What is Augmented Data Lineage?
Augmented data lineage is “regular” data lineage enriched with information from a data catalog: metadata such as real-time data quality, business terms & categories, and anomalies detected in data loads.
Enhanced with this information, data lineage can speed up the process of locating the right data or support analytical activities, such as root cause analysis or data quality analysis. The visual presentation of augmented data lineage alone makes a big difference in a user's ability to draw conclusions, as opposed to just viewing a list of data sets on the catalog's search results page.
Data lineage enhanced with business terms
This enriched data lineage can help answer many questions that are typically addressed with a data catalog search query or by consulting standard data lineage:
- Is this the best data I can use for my data science project or analytic assignment?
- Has this report been generated from valid and timely data?
- Why does a metric in a report contain an unexpectedly large or small value?
- Which data sets contain PII data, and in which systems do they originate?
Let’s examine how these questions can be answered by using augmented data lineage.
Finding Data with Lineage
Imagine a scenario where you need to complete an analysis involving customer birthdates in order to predict purchases of a product by age buckets. You have a data catalog in place, so you search for “customer birthdates” and find a seemingly relevant data set. You get an overview of the whole data set, including assessments of data quality, validity, consistency, completeness, etc. However, looking at the frequency analysis, a data quality issue catches your eye: around 10% of the values are empty or obviously invalid (“NULL” and “N/A” values).
Data asset detail in the data catalog
What do you do now? Go back to the search results and look at another data set with birthdates? How many times will you have to go back and forth like that? Suddenly, what seemed like a straightforward step in the process has become a time-consuming and frustrating endeavor. This is where augmented data lineage can be a game changer.
In the augmented business lineage view, you will immediately see all previous and subsequent transformations of the data set. If someone has already cleansed or prepared a better version of this data set, enriched with data from a different source, it will pop up on your screen in an easy-to-read and easy-to-trace format.
Thanks to business terms and DQ indicators, you will see to what extent the related assets are relevant and usable. Then, you can make a more informed decision about whether to use the current data set or a transformed data set with better quality, or even combinations of data sets from different points in the transformation lifecycle.
Data lineage enhanced with business terms
However, what if the lineage for this particular data set does not reveal any other promising data sets? Read on to learn about an alternative lineage view.
Broadening the Context with Business Term Lineage
In the example above, the user is interested in very specific data: customer birthdates. Such columns will have a proper business term—something like “birthdate” or “date of birth”—assigned. With Ataccama, you can see the lineage for that particular glossary term and find the best quality assets in every system. The example below shows lineage for PII data.
Business term lineage for the ‘PII’ term
Essentially, you get a map of all data sets that contain birthdates, tracking their origin all the way to the source systems. This high-level view provides the full context around specific data and enables users such as analysts and data scientists to pick and choose from the relevant data sets.
Rapid Root Cause Analysis with Anomalies
One of the most well-known benefits of data lineage is that it allows users to perform root cause analyses. The story usually goes like this: the VP of Sales (or someone in a similar role) thinks that the numbers in the new quarterly report do not make sense. Your task is to find out why. Some would say that all you need to complete this task is access to data lineage, at which point you can immediately identify the data sets that caused the problem.
While it's true that having accurate lineage decreases the time needed to diagnose a problem, an even faster method is to use AI to automatically detect anomalies whenever new data is loaded.
Data lineage enhanced with anomaly detection
Anomalies work by comparing the previous and current versions of a data set and detecting notable changes in data characteristics, such as value frequency distribution, minimum and maximum values, unexpected record count, or inconsistency in data formatting. As a result, anyone who is investigating an issue can immediately see where it appeared for the first time, which SQL procedure or ETL transformation caused it, and which data sets were affected. Based on that knowledge, they can take measures to prevent the issue from appearing in the future, as well as seeing that the data set used in producing a report has been corrupted.
The key feature of anomaly detection is that it is completely automatic and needs no configuration since it is powered by machine learning.
In conclusion, augmented data lineage is an enhanced version of technical data lineage, enriched with business metadata, such as business terms, data quality, and AI-detected anomalies. With its help, data scientists, data analysts, data stewards, and other users can more quickly find the data they need, perform root-cause analyses in the cases of data quality decline, or analyze how data quality changes from data sources to data consumption points (data lakes or data warehouses). In these ways, augmented data lineage extends the utility of data lineage to a wider circle of users, providing them with an alternative—and often faster—way of solving known problems.