How well do you understand the state of your data? Chances are not so well, and the bigger your organization is, the higher those chances are. While there is no such thing as a “typical data architecture,” most include several tools spread across the layers of:
- Source data
- Data warehousing or data lakes
- Analytics and data consumption
For example, one of our clients, a large U.S. telco, has 56,000 databases. When you have to manage data at that scale, how do you do that effectively? Luckily, there is an answer: data observability.
What is data observability?
Data observability is an emerging concept, and it’s important to not only give it a proper definition but also to have the right mindset about it.
The term observability comes from systems theory:
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs… In software systems, observability is the ability to collect data about program execution, internal states of modules, and communication between components.”
What is data observability, then?
It’s your ability to understand the state of your data based on the information you’re collecting—or signals you’re receiving—about that data, such as data quality issues, anomalies, or schema changes. More on this later in the article.
In other words, the more information you collect about your data, the better your observability of that data, at least in theory.
Why is data observability important?
Data observability makes your data people's lives easier. Several factors make data observability important now (and probably will make it even more relevant in the future):
- Complex data landscapes within large organizations. Large enterprises with advanced data analytics functions usually have expansive and complex data landscapes with many interconnected components. These organizations need observability to keep track of all these components and prevent data quality issues from spreading to the numerous downstream systems.
- Data democratization movement. In distributed environments with strong data democratization cultures (for example, organizations with an established data mesh practice), teams need a simple way to track the state of data they are responsible for without necessarily focusing on “the whole pipeline."
- Smaller organizations need a simple way to get started with data quality. According to a recent survey, 97% of organizations consider data quality somewhat or extremely important. Data quality is no longer a priority for large organizations in regulated industries or those with massive customer bases. Just like data-enabled teams within large organizations, smaller organizations need an easy way to get started with data discovery and quality, and data observability solutions provide this opportunity.
As organizations expand, data observability becomes even more crucial. Data pipelines grow more complex and are liable to break or experience issues. Larger data systems can have trouble due to challenging coordination, losses in communication, or conflicting changes made by multiple users. Having a precise pulse on everything happening can prevent these problems before they occur and make them easier to solve when they do.
Finding the source of data issues without data observability is much more challenging. Data engineers are forced to firefight data broken pipelines as they appear. Without context, fixing these problems can be like "looking for a needle in a haystack."
Speaking of broken pipelines, data observability makes them less likely to happen in general. This can prevent a long series of business and technical issues, including:
- Bad customer experience, loss of trust.
- Reduced team morale
- Lowered productivity
- Compliance risk
- Missed revenue opportunities
Goals of data observability
To truly understand its importance, let's look at the goals of data observability. Any data observability solution seeks to achieve a state where:
- Data people can quickly identify and solve data problems
- Experimentation can be done to improve and scale data systems
- Data pipelines can be optimized to meet business requirements and strategies.
Data observability vs. data monitoring
Data observability has often been dubbed "monitoring-on-steroids." However, it doesn't substitute monitoring, eliminate the need, or even “take it to a different level.” Instead, monitoring contributes to better data observability and is an integral part of the data observability process.
As we’ve mentioned, achieving data observability requires collecting much more information, which produces a deeper understanding of the relationship between data systems and uncovers actionable insights into the health of your system overall.
To summarize, monitoring is just a tool to help you get better data observability by being part of the tool stack that generates insights about your data. For example, monitoring may alert a team of an issue, but without good data observability, finding the root cause of the problem will be very hard.
The components of data observability
Now that we know monitoring is a part of data observability, what other tools does it require? To excel at data observability, you'll need to collect as much information about what’s happening within your data platform as possible. Some tools that will lead to you attaining that information are:
- Data quality monitoring. Regularly checking data against pre-defined data quality rules and presenting results and trends.
- AI-powered anomaly detection. Some data quality issues are unpredictable and need AI to scan for them. For example, sudden drops in data volumes or average values in a specific column.
- Data discovery. Monitor and track data issues and understand what data appears in your critical systems (especially if you work with a lot of Personal Identifiable Information). Continuous data discovery within an automated data catalog (including a business glossary) also plays a critical role in data quality monitoring.
By applying these processes to your data, you will get a 360° view of your data and achieve good data observability. This is the information you have:
- Data quality information. As measured by data’s validity, completeness, accuracy, and other dimensions of data quality.
- Schema changes. Data architecture changes in source systems can break reports and affect consuming systems, so checking for removed data attributes or changed attribute types is important.
- Anomalies in various data metrics:
- Changes in data volume
- Statistical metrics such as minimum, maximum, averages, or value frequencies
- Changes in business domains of data, such as customer data, product data, transactional data, reference data, etc., or specific business data types, such as “first name,” “last name,” “email address,” “product code,” “transaction id,” “gender,” “ISO currency code.”
- Data lineage. It is crucial to understand the context of every issue or alert. If the issue occurs early in the data pipeline (at or close to a source system), you’ll use data lineage to understand what consumers and reports are affected. If you spot an issue at the report or downstream system level, you will use data lineage to find the root cause of the issue.
- Any other metadata coming from ETL processes.
How data observability works
Just like with software observability, you need tools for data observability. Otherwise, you will have to build the solution in-house, which we don’t recommend because, in most cases, the maintenance and the time to develop new features is simply not worth it.
As proponents of AI and using metadata to automate data management, here is how we approach data observability at Ataccama. Two key principles are:
- The synergy between a data catalog, business glossary, and central rule library. By defining business terms in the business glossary and assigning data quality rules from the central rule library to them, we automate data quality monitoring.
- Apply AI to detect anomalies and suggest business terms (and train it to be more accurate).
Once these principles are in place, the following process takes place for every data source you add:
- The discovery process analyzes the data and detects business terms within specific attributes, such as "this data attribute contains an email address."
- Select the business terms you want to monitor. For example, “I want to monitor names, emails, social insurance numbers, etc.”
- Schedule data profiling runs. Choose how often you want to profile data. Profiling will detect anomalies, apply new business terms, apply DQ rules, and perform structure checks.
- Get alerts, improve AI, consume statistics on the data observability dashboard, and analyze issues, and fix them.
Essential features of data observability tools
We have established that to achieve good data observability, you need specific components that will generate the information for it. However, to make that information actionable, users need specific features to configure these components and easily consume their outputs. Here they are:
- Alerts. Relevant users (such as data engineers, data stewards, or members of the analytics team) should get alerts notifying them of issues or anomalies.
- Dashboards are essential for users responsible for observing the health of systems because they provide trends and summarize information at a glance. Experienced users will decide to investigate specific issues based on prior knowledge.
- Data lineage with data quality context. Users should be able to see the context of an issue within the data pipeline.
- Central rule library. Since data monitoring is an essential component of data observability, users should be able to create and update reusable data quality rules in a centralized, governed, collaborative environment.
- Easy, no code setup. Setting up data observability shouldn't take an army of DBAs, IT admins, and data architects. At the very least, basic setup should be accessible to non-technical users.
- Customization. Each data system and use case is different. The setup should be easy. However, it should be possible to finetune various aspects of the data observability solution: the types of checks to be performed, anomaly detection sensitivity, and data quality thresholds.
- Collaboration. Data management is a team sport, so you should have a platform that not only alerts of issues but lets users assign, track, and resolve issues.
How to succeed with data observability
Now that you've learned what data observability is, how it works, and why it's important, you probably want to get started immediately! Before you do, read these tips to ensure you are on the right track.
We’ve learned a lot of best practices from delivering hundreds of data quality projects, but there is one principle that everyone always mentions: “start small” or “don’t boil the ocean.” It’s valid for getting started with data observability too.
Start with one system, team, or department, and test the functionality. For example, how often to deliver alerts, what anomaly detection settings work best, or the optimal system scan frequency. Learn what works and take those best practices further. Besides, you will be able to share data quality rules you have created and use a more trained AI that knows your data.
Deliver alerts immediately to the right people
Alerts are beneficial only when delivered to relevant people. Suppose your organization's data owners are based on their domain team (i.e., marketing team members are the owners of the marketing data). In that case, sending alerts only to the data engineering team won't help much. One team or individual responsible for the tool is possible, but the data teams and owners should receive all the alerts about their data.
Deliver alerts with context
It's hard to take action on an alert without some insight into where it came from and how it happened. Simply saying "there is an issue" is not sufficient. Providing additional information in the alert itself is crucial. Once you receive alerts, you should be able to drill down into the data and find the following information:
- Profiling results from the data.
- Data lineage information.
- DQ results over time.
Invest in data quality maturity
Discovering issues is not enough and reactively fixing them is not sustainable for large enterprises. You need to have a mature data quality practice that invests time and effort into stopping these issues proactively. You can do so by investing in:
- Processes that prevent issues from happening in the first place. Such as DQ Firewall and building in validations into input fields.
- Automated data cleansing and standardization. For issues prone to repeating at specific source systems or after ETL processes, set up algorithms that will reliably fix the issues in data before it spreads to downstream systems
- Systematic data governance. Establish organization-wide standards for data quality and metadata management and coach employees on the responsible use of data.
Get started with data observability
Getting good data observability of your systems is not complicated. We have launched an easy-to-configure solution that lets you connect your systems, discovers the data inside, lets you select the business domains you want to monitor, and then notifies you when data quality drops or a potential anomaly is detected.
You can take a product tour or sign up for an early access version with support for Snowflake.