It is tempting to believe that data, and the management of its quality is something new, brought about by the advent of new regulations such as E-Privacy and the EU GDPR. It is not. Data, its management, and its quality have been around since information was first created: when we started writing things down.
Data Quality Definition
“Data Quality is the planning, implementation, and control of activities that apply quality management techniques to data to ensure it is fit for consumption and meets the needs of data consumers.”
Data Management Body of Knowledge
We could go further, talking about what is data quality as a process, making data operational, enabling individuals and organizations to draw insights from the data which will inform their decision-making.
The reason we describe DQ as a process rather than a single item is that it comprises various elements that all contribute to the purpose of making data “fit for purpose”. Sometimes people use the term Data Preparation to refer to these elements, though data prep should be considered separate for now.
What are the dimensions of data quality?
Sitting underneath the umbrella term of Data Management, DQ takes a holistic view of an entire dataset, combining these elements – often called the dimensions of Data Quality – to provide a snapshot of the quality of data held.
Are there gaps in the data and if so, where? Some gaps are worse than others and what is considered a gap depends on the process where the data is used. For example, if the billing department requires both phone number and email address, then no record missing one or the other can be considered complete. You can also measure completeness for any particular column. Profiling your data will uncover these gaps.
Are the postcode records you hold in a valid format? How confident are you that the email and postal address records you hold in your database are capable of receiving? Validity checks verify that the conforms to a particular format, data type, and range of values.
Since data-driven automation is so important nowadays, data has to be valid to be accepted by processes and systems that expect it.
Is new information entering your CRM every day in real-time or are you manually importing it? How often is the data “refreshed”? Timeliness is a crucial dimension because of the increasing need for up-to-date data.
Similar to other dimensions, timeliness is user-defined. One kind of data needs to be available on a quarterly basis for financial reporting. Other data must not be older than 5 minutes for real-time analytics.
Do you have the same customer recorded twice in your data set or data catalog? Uniqueness measures how much duplicate data there is in a given data set, either within any particular column or as whole records. For example, in the orders table, each order should have just one row. If, on the other hand, you encounter two records with the same order id, you have a duplicate. How did it get there? Someone could have mistyped the order number. This brings us to the next dimension: accuracy.
Perhaps the most important dimension, accuracy refers to the number of errors in the data. In other words, it measures to what extent recorded data represents the truth. Accuracy is tricky because data might be valid, timely, unique, complete, but inaccurate.
100% accuracy is an aspirational goal for many data managers, and once achieved, the principles of data governance can be combined with DQ to ensure the data does not degrade and become inaccurate ever again.
Do you have conflicting information about the same customer in two different systems? That means the data is inconsistent, which might lead to inconsistent reporting and poor customer service.
The Importance of Data Quality and its Value
Of course, everyone wants to know "why is data quality important?" However, we believe an even more important dimension to data needs to be discussed here: value.
Our definition of data quality's value is this: what are the business, risk, and financial values assigned to any piece of information? In this manner, data analysts and other practitioners of data management can quickly assign priorities to different data sources or specific data domains when they do data quality projects.
We recommend using a tool to assign literal values to your data such as:
Business - how valuable is, for instance, Employee salary data to marketing? Chances are, it has a much higher business value to the HR department, whereas customer emails are more useful for marketing.
Risk - are you holding Personally Identifiable Information (PII)? This means you could be exposed to the risk of GDPR fines if this data is not accurately protected to ensure the individual’s privacy.
Financial - eCommerce companies are the best example of the financial value of data: typically email address and credit card numbers are all that is needed in order to transact with customers and therefore profiling the data, keeping it of high quality, and reporting it over time can help eCommerce businesses understand the average value of customers and accurate email addresses.
As you can see from these examples, Data Quality tools can quickly become mission-critical for your business, depending on the quality of the data you hold that you need to perform day-to-day operations. So, why is data quality important? Because it adds value.
What are the business costs and risks of poor data quality?
Data quality maturity curves are becoming more prevalent, and organizations can quickly ascertain whether they’re reactive or optimized and governed in their approach to data management.
An example of an organization that is immature in its capture and management of data is one that does not use validation fields or uses free-form capture fields on the contact forms of its website, allowing anyone to enter whatever they like.
Bad data should not be taken lightly as it poses significant risks and business costs. Below are several examples:
- Wasted marketing budget: if your organization is sending physical mail to your customers and marketing leads, but those addresses are out of date or invalid, you’ll be wasting precious marketing dollars and time.
- Non-compliant data: regulations such as GDPR require a certain standard (Article 5) of how to maintain Data Quality in relation to the accuracy and integrity of data. If an organization’s data is found to be non-compliant with data-driven regulations such as the EU General Data Protection Regulation (GDPR) they can be fined up to 20 million euros or 4% of annual turnover - whatever is higher!
- Hindered IT modernization projects: when data moves from source to target system, without correct mapping and data quality tools, old dirty data can wreak havoc on the new system.
- Poor customer experience: If contact information is of poor quality, you cannot provide customers with a tailored customer experience and serve them via their preferred channel.
- Fines: In regulated industries such as healthcare and banking, enterprises risk miscalculating key statistics for regulatory reports and getting fined.
- Unreliable analytics and machine learning: Inaccurate or invalid data will provide inaccurate analytics and unreliable machine learning models.
- Strategic operational mistakes: Building a warehouse at the wrong location, not catching fraud, producing the wrong alloy are all examples of using bad data for business decision-making.
And yes, you can put a number on data quality.
Bad data costs companies 10-30% of their revenue and correcting mistakes in data costs $1-10 per record.
What are the benefits of better data quality?
There are so many benefits to improving the quality of your information that it is impossible to list them all out, but some of the common ones include:
- Increased return on investment for marketing activity thanks to improved email and postal deliverability and more reliable targeting
- Less time spent fixing dirty data. This will save you $1-10 per record.
- Increased ability to personalize your service or product offerings
- Improved, faster decision-making
- Compliance with new and existing regulations and the creation of a consumer-centric data-driven culture
And many more. Ultimately, your business is unique, and therefore how you benefit from improved DQ is also unique.
Giving Voice to the Business Benefits of Data QualityWatch webinar
On demand webinar.
What are must-have features to ensure data quality?
If you'd like to learn about all the essential capabilities of data quality, you can read the full article here.
Before you do any data quality checks, it’s important to examine your data at its source to better interpret and understand it. Data profiling does this faster and more efficiently than via SQL queries. It helps with defining what transformations are necessary for the data and what problems to track in the future.
Data cleansing and transformation
Very often you need to transform data to improve its quality. This includes:
- Format standardization
- Parsing data and breaking it down into separate attributes (e.g., full name into first name and last name)
- Data enrichment: bringing additional data from external sources
- Data deduplication: remove duplicates from data
- Data masking: sometimes you need to obfuscate data for security reasons
It’s important to note that these processes need to happen automatically to any new data before it travels to other systems and makes its way to data analysts and is used for business decision making.
That being said, it's even more beneficial and smart to establish processes that validate and “treat data” before it enters any IT system. This is called a data quality firewall. An example of this is an algorithm that checks data entered into a web form against a required format and alerts the user to fix it, such as email addresses or birth dates. But DQ firewalls can be embedded into complex enterprise applications as well.
Monitoring and reporting
Peter Drucker said it best: “If you can’t measure it, you can’t improve it.” It’s as valid data quality as it is for business in general. Tracking changes and improvements to data over time is crucial and is usually done through data quality dashboards.
First, it shows you whether you are moving in the right direction, i.e., whether the data quality metrics that you have defined are improving or not. Second, monitoring data quality helps catch unexpected influxes of bad data and track it to its source. And third, it helps with tracking compliance with regulatory requirements and more.
If you want to know more, here are some frequently asked questions about data quality.
Can the Data Catalog and Data Quality work together?
Yes! Monitoring your data quality is much more efficient and accessible when integrating it with your data catalog. More specifically, you can automate data quality workflows using the metadata from the data catalog. Here are other ways the data catalog and data quality benefit each other:
- Automating data quality monitoring
- Improving data discovery
- Streamlining on-demand DQ evaluation
- Simplifying data preparation
- Helping discover root causes of quality issues
What is a real-world example of bad data quality affecting analytics?
One of the most common places we find data quality is during census analysis. Many censuses are taken in paper and digital format, leading to quality discrepancies like unreadable inputs and duplicate entries for the same applicant. Most census data undergoes data profiling, standardization, enrichment, matching and consolidation, and relationship discovery before it’s considered fit for analysis.
How to get started with data quality?
Data quality management can seem like a bit of a daunting task. In our opinion, the first steps of any data quality improvement are:
- Determine your current goals and scope (help with a specific business problem dependent on data or focus on a specific critical data element).
- Profile your data.
- Fix the most urgent issues as soon as possible
- Come up with metrics and methods for measuring its quality.
- Monitor data quality problems.
- Scale your program to other teams, departments, source systems, and critical data elements.
Following this process will ensure you find the relevant strategy for your organization and won’t embark on a task that is overwhelming or inadequate.
How important is data quality for successful AI implementations?
Data quality is essential for successful AI implementations. Spending too much time preparing data is one of the main reasons AI is so expensive and time-consuming. You can ensure more successful AI implementations if you:
- Profile your data
- Perform DQ evaluations
- Have regular DQ monitoring
Otherwise, you’ll be building machine learning models on the wrong sets, inevitably leading to errors or more work for your AI architects.
Where is Data Quality headed in the future?
Data quality is undoubtedly here to stay, but what kind of innovations can we expect? Well, you can expect the following improvements in the next few years:
- Further automation will enable greater adoption of new architectures like the data fabric and data mesh.
- The term is growing to encompass other aspects of data management like reference and master data management.
- Data being deliverable to any user at the company regardless of skillset.
- Data quality tools are becoming singular solutions instead of fragmented features that can cause conflict.
- More systems than people are consuming data.
- Much more!
If you’d like to learn more about the future of data quality and how we got here, you can find it all here.
Improve DQ with Ataccama
An important first step is to profile your data to understand just what state it is in. There are several data management tools that you can use to do this, many of which offer free versions.
Get started with data quality todayDownload data profiler
Online or desktop