Essential Features of Data Catalogs
Modern enterprises are data-driven, making effective data management one of the top priorities for companies. A data catalog is an essential part of a data management strategy, and enables users to easily find, understand, and trust their organization’s data.
Because a data catalog creates value for organizations by establishing an inventory of data and metadata that is useful for both business users and IT professionals as part of data quality, data governance, and data enablement initiatives.
To bring to life the value that catalogs create, we’ll delve deeper into the features that you should look for when considering technology to help you with your data cataloging.
Data ingestion & discovery
To implement an effective data catalog solution, you need to be able to connect it to all (or at least the majority) of the company systems — applications, databases, files, and even external APIs.
Good data catalogs contain a number of pre-built adapters. They automatically discover all metadata from systems, such as table names, names of attributes, constraints, etc.
However, it’s essential that data discovery is not a one-off activity; instead, the data catalog should scan sources continuously to discover new data sets and keep a history of data as well.
Search — to let people find the data
One of the most important features of a data catalog is Search and Find functionality. A data catalog should be the “Google” for all of your company data and metadata. It should be smart, and quickly find relevant data for users, even if they don’t know exactly what they are searching for. It should help users discover new and most trusted data sets with a single click.
Besides simple full-text search with AI suggestions, data catalogs support more granular, advanced queries. For example, “find all ‘customer’ data sets from CRM with more than 10,000 records and overall data quality of 85% or more.”
Knowing what tables or fields are in which systems is not enough — you have to be able to link them to business terms in order to explain to end users what specific data means. This is why business glossary functionality is essential.
A business glossary is the “FAQ” of your company, and explains the meaning of the data, e.g. what “Days Past Due” means and how it is calculated. Even seemingly simple terms like “active customer” can be defined inconsistently: is it a customer who took a loan 5 years ago and already repaid it, or is it a customer who actively deposits money each month? Can an employee be an active customer?
A business glossary should be used across the whole data catalog but should also be integrated with external applications such as business intelligence (BI) tools to enhance reports. This is an essential feature, as it will help you decrease the number of questions and amount of back-and-forth in your organization, whether about definitions of business terms used on a regular basis in different departments, the meaning of data in unknown attributes, or how a particular report was filtered.
Metadata management & templates
Good data catalogs allow you to freely add additional metadata, tag your terms with things like a data category (e.g. sensitive, GDPR, PII related, track business owners) and any other important information. They also enable the management of any kind of metadata, not only about data but also about things like reports, APIs, servers, or anything else in your landscape.
Finally, it’s critical to support custom metadata attributes that enrich data sets with them. These could be attributes like “department,” “business owner,” “technical steward,” “certified dataset” or anything else that makes sense for your organization.
Data lineage helps users understand the origin and destination of any data asset in a data catalog, how the data was transformed or enriched on the way to obtaining the final result, how different pieces of data are related to one another, and so on. Data lineage is essential for meeting regulatory requirements for the traceability of calculations and data preparation. As such, it should be considered an essential part of any data catalog solution.
This is a more recent trend in metadata management solutions. As a data catalog is a central place for users to find data, it’s both obvious and logical that the user would want to both access and to be able to use the data from this place.
If the data catalog tool allows users to download the data set or connect it to their BI tool of preference or other applications, it becomes a kind of marketplace where employees can “buy” or go shopping for company data. One problem is, of course, access policies and data governance.
You can’t just give everyone access to any data.
Data catalog tools can approach this in several ways. One, they can support a workflow process where a user submits a request and gets access to data after being approved. Two, restrictions can be applied automatically based on the data domain and the role of the person in the organization. The third way is a combination of the manual and the automated approaches.
Always up-to-date: AI does the manual work for you
A lot of the things mentioned above are done manually by users of the data catalog solution. This is usually a time-consuming process, requiring a great deal of effort by company employees, especially when the solution is rolled out. Over time, however, the data tends to become obsolete. Users then stop using the solution because the catalog is incomplete — data is missing or outdated. Imagine going to your catalog to look for the term “marketing consents” and finding out that your colleague Jane is the owner, but no longer works at your company. Or you might find a data set that’s a few years old. You’re unlikely to ever go back to the catalog and you might even start to discourage your coworkers from using it.
This is precisely why you need automation. Here are some activities that can be automated:
- Scanning source systems for new data; detecting and documenting new data items
- Automatically profiling data to give users info about what’s inside the data
- Automatic domain detection (finding out what’s inside the data) to keep things like GDPR attributes up-to-date, discovered, and with an assigned business owner according to the domain or system where the data comes from.
- Detecting similarities in data, and trying to guess the relationship between data points in different data sources. This also includes detecting duplicate data and allowing users to join or merge data from different source systems.
Learn more about real-world applications automation for data cataloging and metadata management in this blog post.
Data quality monitoring & anomaly detection
Users may be wary of using data, especially where they’re unsure if they have the right source or if the quality is dubious. The ability to monitor data quality and how it changes over time can be embedded directly in the data catalog, helping users understand if they can trust a particular data set and whether it’s fit for the purpose at hand.
AI can be used to detect anomalies or sudden changes in data and notify users about such events, allowing errors to be corrected continuously.
A catalog is for every user and user experience must be part of the product strategy
It’s possible to use Excel as a data catalog. However, the key to ensuring long-term use by users is usability. The tool you choose has to have this as part of its DNA.
A catalog is a tool for both business and technical users. The catalog has to be accessible to everyone, with advanced features reserved for data stewards and technical users, such as data engineers.
The user experience is created through subtle and simple things like the ability to rate a data set, comment on it, share it with coworkers, etc. While simple, these features are key to data catalog adoption.
It’s critical to understand that while just 1% of your company will create and update the content of your catalog, 99% of users will consume it.
The more “likes” a content producer sees, the more they will see value in keeping that thing alive. The more likes a user sees, the more they will understand that they’re looking at something useful.
Don’t just rely on crowdsourcing. Automation keeps a catalog up-to-date and is a must for the long-term survival of your data governance initiatives. Any tool you select for the job should help you achieve this. If this sounds great to you, learn more about our automated data catalog here.