Data profiling is the first step to any data initiative. It’s a series of checks and analyses undertaken to gain an increased understanding of data.
What exactly is data profiling?
Once you upload your source data, a data profiler generates information about data patterns, numeric statistics, data domains, dependencies, relationships, and anomalies.
Companies can then use this information to evaluate their data sets (or even single columns within the set through column profiling) and proceed with the data initiative at hand. Whether it’s a simple data analysis or something complex like building a data quality program, a data migration, designing or reviewing architecture, or creating a master model (get more detail about these use cases further below).
Anyone can benefit from using a data profiler because it provides essential information about any data set or data source. To better understand this benefit, let’s look at the types of information captured in a data profile.
What information can you get from data profiling?
Some of the critical insights a data profiling task can provide are:
Data Set Overview
This will be an overall summary of information about your data set. The data profile viewer will include the number of records and attributes, the types of data stored there, relationship discovery, how many of each type, etc.
Basic Data Quality Information
Your profiler will also provide vital information about the quality of data in your set. It will determine quality based on things like a set's completeness (how complete each entry is, if there is a null value, or if there's inaccurate data) and uniqueness (whether or not there are multiple entries for the same data within the set).
Data Formats and Patterns
Data quality enthusiasts know that there are a finite number of formats for postcodes, for example, and that they should be alpha-numeric. Profilers can visualize the different formats and patterns so that you can understand how many values are off.
Profilers generate information about duplicate values within a data attribute, showing you the most common or distinct values.
Data Domains or Custom Data Tags
Advanced data profiling tools detect what kind of data is stored in a data set and label it. For example, you will see which attributes contain emails, PII, credit card data, or address information.
Other features include detecting data dependencies, checking data against a specific business rule, or slicing data (e.g., by gender, zip code, city, etc.), and analyzing profiles of those particular slices.
Why is Data Profiling Important?
It’s hard to understand whether or not a data set is useful or usable without profiling it first. Whatever the use case might be, using data without fully understanding its contents and quality is at best irresponsible.
Despite this, businesses often overlook data profiling because the service is usually packaged within a more comprehensive data quality platform. However, in many data-specific use cases, the relevance and usefulness of data profiling is striking.
Use Cases for Data Profiling
In all of these use cases, data profiling is the first step to secure vital information about a data set before moving on.
- Starting a data quality or data governance initiative. Data profiling is very often the first step to building a data quality or data governance program. It uncovers various repeating problems in data that lead to data quality issues. It can also help data stewards create a data rule for cleansing and monitoring data and establishing data governance policies.
- Building a master data model. The benefit of data profiling for master data management is twofold:
- First, it gives an overview of where the data of interest is located, for example, which systems store customer data.
- Next, it provides information about inconsistencies in formatting and value, which, if not standardized, would make the data matching process longer and more compute-intense.
- Performing data migration. Before a data migration project, profiling data lets data stewards correct errors and perform data cleaning before the data is transferred.
- Evaluation of data suitability and usability. At some point, everyone works with data. Having a tool that gives you an overview of a data set is useful for anyone from digital marketers to rocket scientists.
- AI and Machine Learning. Data profiling tools are also an important component of preparing data for AI or machine learning.
Data Profiling Tips
Here are several tips for planning and maximizing the efficiency of your data profiling activities:
- Separate priorities from the noise. When strategically profiling on a legacy system you can run into massive walls of erroneous data, the question is if you should care. You have to decide which data sets are most important and need their quality addressed first (CDEs).
- Be careful about the conclusions you draw from profiling. There are different types of data, reference, transactional, master, this will affect the way you should profile and the actions you take afterward. For example, a DQ issue in a transactional dataset could only affect that one particular entry, however, with master data one error could potentially impact thousands of records.
- Try to narrow down the sets of your profiling as much as possible. If you know that 95% of profit comes from 10% of your sales then you can eliminate large sections of your data you would need to profile.
Data Profiling Real-World Examples
If you’re still not sure about the importance of data profiling, look at these real-world examples.
Uncovering fraud in a bank
It might sound surprising, but if you know your banking business well, profiling might help you detect fraud. One of Ataccama’s users analyzing data profiling results of several data sets on banking transactions found outliers in the frequency distribution of phone numbers.
After looking more closely into a few of them, she uncovered that each phone number was associated with several clients. Finally, she passed the information to the fraud team, who confirmed several fraudulent transactions and set up measures to prevent this in the future.
Ensuring data usability in the drug development process
Developing new drugs is a data-intensive process. Researchers collect and analyze data on thousands of combinations of compounds and cooperate with external laboratories to speed up the process. This means data is exchanged a lot.
So, when in-house researchers receive data from a cooperating party, they profile it to make sure the formatting is correct, verify the data contents, and check for other potential errors. Data profiling helps researchers only work with reliable, verified data.
Learn more about why data profiling and data management disciplines are important in the pharma industry in this in-depth article.
Data Profiling with Ataccama
As you can see, you don't have to be a data scientist to benefit from data profiling. It is a powerful tool that can be used in various situations by people whose main job is not necessarily analyzing sales data or building predictive models. Check out our data profiling page to learn more.