Essential Features of Data Catalogs

Essential Features of Data Catalogs Cover Image

See the
platform
in action

Modern enterprises are data-driven, making effective data management one of the top priorities for companies. A data catalog is an essential part of a data management strategy, and enables users to easily find, understand, and trust their organization’s data.

Why?

Because a data catalog creates value for organizations by establishing an inventory of data and metadata that is useful for both business users and IT professionals as part of data quality, data governance, and data enablement initiatives. 

To bring to life the value that catalogs create, we’ll delve deeper into the features that you should look for when considering technology to help you with your data cataloging. 

Data catalog fundamentals

If you are new to data catalogs, check out our comprehensive blog post on the fundamentals of data catalogs to understand how they work and what benefits they bring to data people.

Data ingestion & discovery

To implement an effective data catalog solution, you need to be able to connect it to all (or at least the majority) of the company systems — applications, databases, files, and even external APIs. It's even better if they can connect to relational and NoSQL databases, data lakes, data warehouses, cloud storage, metastores, streams, files, BI tools, and analytical platforms via connectors

Good data catalogs contain a number of pre-built adapters. They automatically discover all metadata from systems, such as table names, names of attributes, constraints, etc.

However, it’s essential that data discovery is not a one-off activity; instead, the data catalog should scan sources continuously to discover new data sets and keep a history of data as well.

Search — to let people find the data

One of the most important features of a data catalog is the search and find functionality. A data catalog should be the “Google” for all of your company data and metadata. It should be smart, and quickly find relevant data for users, even if they don’t know exactly what they are searching for. It should help users discover new and most trusted data sets with a single click.

Besides simple full-text search with AI suggestions, data catalogs support more granular, advanced queries. For example, “find all ‘customer’ data sets from CRM with more than 10,000 records and overall data quality of 85% or more.”

Business glossary

Glossary Hierarchy

Knowing what tables or fields are in which systems is not enough — you have to be able to link them to business terms in order to explain to end users what specific data means. This is why business glossary functionality is essential.

A business glossary is the “FAQ” of your company, and explains the meaning of the data, e.g. what “Days Past Due” means and how it is calculated. Even seemingly simple terms like “active customer” can be defined inconsistently: is it a customer who took a loan 5 years ago and already repaid it, or is it a customer who actively deposits money each month? Can an employee be an active customer?

A business glossary should be used across the whole data catalog but should also be integrated with external applications such as business intelligence (BI) tools to enhance reports. This is an essential feature, as it will help you decrease the number of questions and amount of back-and-forth in your organization, whether about definitions of business terms used on a regular basis in different departments, the meaning of data in unknown attributes, or how a particular report was filtered.

Metadata management & templates

Good data catalogs allow you to freely add additional metadata, tag your terms with things like a data category (e.g. sensitive, GDPR, PII related, track business owners) and any other important information. They also enable the management of any kind of metadata, not only about data but also about things like reports, APIs, servers, or anything else in your landscape. 

Finally, it’s critical to support custom metadata attributes that enrich data sets with them. These could be attributes like “department,” “business owner,” “technical steward,” “certified dataset” or anything else that makes sense for your organization.

Data lineage

Data Lineage Profiling

Data lineage helps users understand the origin and destination of any data asset in a data catalog, how the data was transformed or enriched on the way to obtaining the final result, how different pieces of data are related to one another, and so on. Data lineage is essential for meeting regulatory requirements for the traceability of calculations and data preparation. As such, it should be considered an essential part of any data catalog solution.

Automation and AI

Relationships Discovery

A lot of the things mentioned above are done manually by users of the data catalog solution. This is usually a time-consuming process, requiring a great deal of effort by company employees, especially when the solution is rolled out. Over time, however, the data tends to become obsolete. Users then stop using the solution because the catalog is incomplete — data is missing or outdated. Imagine going to your catalog to look for the term “marketing consents” and finding out that your colleague Jane is the owner, but no longer works at your company. Or you might find a data set that’s a few years old. You’re unlikely to ever go back to the catalog and you might even start to discourage your coworkers from using it.

This is precisely why you need automation. Here are some activities that can be automated:

  • Scanning source systems for new data; detecting and documenting new data items
  • Automatically profiling data to give users info about what’s inside the data
  • Automatic domain detection (finding out what’s inside the data) to keep things like GDPR attributes up-to-date, discovered, and with an assigned business owner according to the domain or system where the data comes from.
  • Detecting similarities in data, and trying to guess the relationship between data points in different data sources. This also includes detecting duplicate data and allowing users to join or merge data from different source systems.

Learn more about real-world applications automation for data cataloging and metadata management in this blog post.

Integration with data quality tools

Data Monitoring Detail

Users may be wary of using data, especially where they’re unsure if they have the right source or if the quality is dubious. The ability to monitor data quality and how it changes over time can be embedded directly in the data catalog, helping users understand if they can trust a particular data set and whether it’s fit for the purpose at hand. 

AI can be used to detect anomalies or sudden changes in data and notify users about such events, allowing errors to be corrected continuously.

Social features

The user experience is created through subtle and simple things like the ability to rate a data set, comment on it, share it with coworkers, etc. While simple, these features are key to data catalog adoption.

It’s critical to understand that while just 1% of your company will create and update the content of your catalog, 99% of users will consume it.

The more “likes” a content producer sees, the more they will see value in keeping that thing alive. The more likes a user sees, the more they will understand that they’re looking at something useful.

Governance Features

Another important component of any data catalog is helping with data governance. Data catalogs need to be able to label sensitive data, set up access permissions, and even set time limits for data storage. Features such as access management and approval workflows can be set up through the catalog to ensure that the right people always handle data appropriately. 

Data catalogs can also assist in data stewardship by providing a clear map of data ownership with all relevant stakeholders identified in the catalog. 

Don’t just rely on crowdsourcing. Automation keeps a catalog up-to-date and is a must for the long-term survival of your data governance initiatives. Any tool you select for the job should help you achieve this. If this sounds great to you, learn more about our automated data catalog here.

See the
platform
in action

Get insights about data quality in your inbox Subscribe

Related articles

Arrow Right
Arrow Left
Blog
What is a data catalog?

What is a data catalog?

What is a data catalog? This ultimate guide explores what it’s made of, how to…

Read more
Blog
4 Reasons Your Data Lake Needs a Data Catalog

4 Reasons Your Data Lake Needs a Data Catalog

Data lakes contain several deficiencies and bring about data…

Read more
Blog
Boost Data Catalog Crowdsourcing with Automated Metadata Capture

Boost Data Catalog Crowdsourcing with Automated Metadata Capture

Learn why crowdsourcing should come as step two in data catalog…

Read more
Blog
Data Catalogs: Accelerating Analytics and Data Quality Operationalization

Data Catalogs: Accelerating Analytics and Data Quality Operationalization

See how a data catalog can accelerate data discovery, data preparation, and…

Read more