With the number of data sources and the amount of data in enterprises growing, automating metadata management, data management, and data governance tasks becomes a necessity, not just a way to stay ahead of the competition. Not only do data-intensive organizations struggle to understand and effectively operationalize metadata, but they also face more regulatory pressure in regards to data privacy.
That's why we were not surprised when one of our clients approached us to help them automate metadata capture and maintenance, and replace an existing data catalog solution. Read on to learn what metadata management automation means in a real-world scenario and how we implemented it with Ataccama ONE.
High level project overview
The client's data environment for the project looked as follows:
- Data was stored in Hive on two environments: QAT (testing) and PROD (production).
- Metadata about Hive tables was stored in more than 300 Excel files, with each file containing metadata about one or more tables. The structure of the Excel files was not always the same.
As part of this project we had to deliver the following:
- The workflow for updating the data catalog (implemented in Ataccama ONE) when new data assets were created in Hive as part of a CICD pipeline.
- Replicate all metadata residing in an existing data catalog solution in Ataccama ONE (a one-time job).
What we accomplished
On this project, we:
- Successfully imported metadata about all tables from Hive to the Ataccama ONE Data Catalog enriched with additional metadata from Excel spreadsheets.
- Replaced a slow, manual process for importing new metadata assets in a data catalog with an automated one.
- Helped the client stop using Excel spreadsheets for capturing new or updated metadata in Hive.
- Automated several data governance tasks requested by the data governance team.
Automating metadata capture
One of the main deliverables for the project was to establish an automated process for detecting changes in Hive and triggering a data discovery process to import new or updated metadata to the data catalog. Let's look at the original and new processes.
The original metadata capture process
Before implementing Ataccama, the process for populating the data catalog solution with metadata looked as follows:
- A change in Hive occurred (new data was loaded and existing data was edited).
- The data owner filled in the metadata in an Excel spreadsheet.
- The data governance team validated the information in the Excel spreadsheet.
- The spreadsheet was imported into the data catalog solution (QAT environment).
- Metadata assets were manually imported to the PROD environment.
One obvious problem with such a metadata management process is that it is not centralized. Metadata was gathered and managed separately from being provided. This introduced a lot of delays and inconsistencies.
The new process with Ataccama ONE
After capturing the requirements, we designed a process for creating and updating data assets in the Ataccama ONE Data Catalog automatically when new data arrived while preserving the data governance team’s ability to validate and edit metadata.
A data asset is the representation of a source table in the data catalog. A data asset contains metadata about the table: the number of attributes and records, the data profile of the table, the owner of the table, table descriptions, internal classifications, etc.
- A Hive table is developed or updated as part of a CI/CD process.
- The CI/CD process sends a REST API call to the Data Catalog, which scans the new/edited table and adds it to the data catalog along with basic metadata and autodetected business terms.
- A ticket is created in the internal issue tracking system for safe-keeping and internal audit.
- The user assigned to the ticket gets an email notification, checks the new data asset, and updates metadata that is not captured automatically in the QAT environment, for example, by adding a missing description.
- After the ticket is resolved, Ataccama ONE propagates metadata from the QAT version of the data asset to the PROD version.
With the new process, data owners don’t have to manually fill in Excel spreadsheets since all the technical metadata, like column names and data types, is captured automatically. However, data governance is maintained since every new table or update creates a ticket. Additionally, business terms are assigned or suggested based on defined rules or AI algorithms. Finally, data stewards and data owners can create and maintain metadata collaboratively in a single web application, each group having a different set of permissions appropriate for their roles.
Replicating existing data assets in Ataccama ONE
Before launching the new process described above, we replicated data assets in the existing data catalog solution in the Ataccama ONE Data Catalog. The process consisted of three parts:
- Profile all table names found in the Excel spreadsheets and capture technical metadata directly from Hive.
- Import table level metadata (tags, descriptions).
- Import column level metadata (tags, descriptions).
When we examined the Excel files with metadata, we discovered that these files followed one of two structure conventions. Therefore, we configured two different interfaces to parse the spreadsheets and retrieve all table names, table-level metadata, and column-level metadata.
Step 1: Profile all tables found in the Excel sheets
We retrieved all table names in all 300 Excel spreadsheets, found those tables in Hive, and profiled them in the Ataccama ONE Data Catalog. After a table is profiled, a data asset is created in the Catalog.
We used Ataccama plans with REST API calls to achieve this.
In Ataccama, a plan is a series of transformations through which data flows from one source to another. In this case, instead of data, metadata was going through the transformation pipelines.
Step 2: Import table-level metadata
The client’s tables had a few pieces of business metadata, like source system name, classification, subject area, source table name, description, etc. We were able to accommodate all of this metadata by creating custom attributes on the imported tables. Tables in the Ataccama ONE Data Catalog can contain any number of custom attributes depending on client requirements. New attributes are automatically available in REST API.
Beyond custom attributes, the client had a few category-like pieces of metadata, so we imported them as glossary terms.
Glossary terms make up an organization’s business glossaries and are the words and phrases used to tag and categorize data. Business terms are catalog assets on their own and have descriptions, owners, and other attributes, and differ from organization to organization. In the customer data domain, typical terms are customer, revenue, first name, ZIP code, SSN, etc. Once data assets are tagged with appropriate terms, data scientists and analysts can find exactly the data that they need.
Below is a part of the plan that populates the custom attributes Description and Source Field Data Type. We read the data from Excel and populate the correct data attributes in the Catalog via REST API (JSON Call step in the plan below).
Step 3: Import attribute level metadata
It wasn’t only the imported tables that had metadata—their data attributes did, too. These were metadata attributes like description and source data type. Again, we used Ataccama plans and REST API to scan the Excel spreadsheets, retrieve attribute-level metadata, and assign it to the right attributes in the right tables.
The importance of flexibility and automation
Time and time again we are convinced that no two organizations are the same. Each one has its own unique metadata structure, metadata store, and business processes. We met with Excel spreadsheets in this particular project, but next time it might be a database, a REST or SOAP interface, etc. Flexibility is one of the defining characteristics of the Ataccama ONE platform, so we can fulfill most requirements without additional coding. In the example above, we did it just by modifying the input interfaces and configuring custom attributes in the Data Catalog.We used our reusable component library and business workflow engine to create a tailored solution that fit the client’s business processes. Some of the tasks we automated were:
- Profiling tables on demand
- Generating DDL based on the data asset metadata and storing it inside the data asset
- Propagating metadata between data assets
- Adding an extra business term based on data asset composition
- Sending a notification to the table owner in cases where a PII data domain is detected but the data sits in a non-encrypted database
You may have noticed that the “plans” used to propagate metadata from external sources to data catalog assets are strikingly similar to some ETL solutions. This is not a coincidence. We used the flexibility of the existing data processing engine and treated stores of metadata and APIs as sources and targets for transformations. Since modern data processing solutions usually work quite well with data in JSON and XML formats popular in REST APIs, we were able to deliver a flexible, tailored solution for metadata automation without writing a single line of code.
Automation in metadata management is the new norm
Metadata management is no easy feat, therefore both processes and technology are important. Enterprise clients increasingly require easy-to-use tools, more and more automation, and enough flexibility without extra custom development and cost. On this project, we were happy to integrate with the client’s existing infrastructure, add the needed custom attributes in the data catalog, automate their business processes, and get rid of Excel.