Blog

Enterprise Data Quality Fabric: A Primer

Data Quality Fabric Primer Cover Image

Post-recognition from Gartner in 2019 as one of the emerging data analytics technology trends, the global Data Fabric market is expected to reach nearly $2bn in 2022, according to research from KBV. This represents over 20% growth and indicates that enterprise interest in Data Fabrics continues unabated.

We appreciate that Data Fabric can seem complex and intimidating, so we’re going to demystify it for you right now in an accessible way. We also want to take the opportunity to present the concept of Enterprise Data Quality Fabric as an augmented version of the “standard” Data Fabric. By the time you’ve finished reading it, you’ll understand what a data fabric is, what it does, and how it is relevant to you. You will also understand why data quality should be part of your data fabric.

What is a data fabric?

A data fabric is a data architecture design, which automates data integration and delivery of data to users and machines.

What this means in practice is when a user or an algorithm is requesting data, the data fabric will:

  1. Pull the most relevant and valid data from the most relevant data sources.
  2. Integrate and prepare it if necessary.
  3. Process it in the most efficient way (in-place or pull to a central processing engine).
  4. Deliver that data in the requested format (a file, a datamart, a web service/API), frequency, and quality.

If you are a Gartner subscriber, read the Demystifying the Data Fabric paper for an in-depth exploration of the concept and discussion on the key components of a data fabric.

Introducing Enterprise Data Quality Fabric

Here at Ataccama, we believe it’s impossible to talk about the delivery of data without talking about data quality. That’s why we are introducing the concept of Enterprise Data Quality Fabric. It builds on the standard data fabric principles but adds the following capabilities that ensure that the data delivered to the end consumer is reliable, valid, and fit for purpose. They are:

  • Embedded data profiling and classification
  • Embedded and automated data quality management: assessment, monitoring, standardization, cleansing, enrichment, and issue resolution
  • Anomaly detection based on self-learning AI/ML models
  • Master data management
  • Reference data management

Enterprise Data Quality Fabric defined

Here is the definition of the Enterprise Data Quality Fabric:

A modern way to deliver quality data to the relevant teams and algorithms that need it, whenever and however they need it, with governance, quality, and compliance ensured automatically.

If data governance is a combination of people, processes, and data, Data Quality Fabric is the combination of data and technology to automate data governance and simplify many data-dependent processes, such as data science or data engineering.

How is the Enterprise Data Quality Fabric relevant to you?

The AI-powered Automated Data Integration that a data fabric provides is relevant to data management professionals such as analysts, engineers, scientists, and stewards in many ways because it enables them to:

  • Easily integrate large volumes of data, quickly
  • Automate metadata collection process and keep metadata in sync
  • Create an organization-wide single view of data
  • Improve security and risk management through automated governance

Benefits like these impact most if not all individuals within a business, but are especially relevant to those either directly responsible for or have objectives and goals that rely heavily on data.

How does the Enterprise Data Quality Fabric work?

Universal connectivity

The fabric connects and integrates data from all sources that you consider important and contains relevant and useful data. This means your data lake, ERP, file server, and data warehouse provide data to the fabric, which it ingests, processes, and integrates automatically.

DQ Fabric

Augmented data catalog

A data catalog is at the center of any data fabric. Why? Because it collects all metadata in one place and enables the use of that metadata for automation. Metadata is the main source of automation for the data fabric. The data catalog enables users to look at data from a semantic perspective as mentioned above. An important feature of such a catalog is its ability to self-maintain the information stored in it and infer additional metadata from existing information. This enables the fabric to use the most contextually appropriate metadata.

Metadata Layer

Interconnected data management

Supercharging the Enterprise Data Quality Fabric is a shared metadata repository, an AI core, and a central orchestration unit. AI core enriches the knowledge graph of the metadata repository by inferring additional information about the data from all available sources. Operating on top of a fully realized knowledge graph, the recommendation engine can then provide relevant data and metadata assets as well as the most optimal way to obtain them.

For example, the information obtained from data profiling contributes to the creation of data quality rules, thus adding inferred information to the knowledge graph. Subsequently, the results of data cleansing and standardization can be used to master data and create golden records—without configuring these separately for a master data management project.

Ultimately, when a user asks the Fabric for customer data, it will:

  1. Analyze all the available metadata:
    • business definition (what "customer" means)
    • overall data quality
    • data domain information
    • golden record certification
    • relationships
    • data lineage information
  2. Provide the most relevant and valid datasets to her in the most relevant format.

Then, as part of a prototyping exercise, she can easily join and transform these datasets and store the metadata about the resulting, new dataset into the data catalog, including its lineage or export it and send it to the consuming process while keeping this information in the data catalog for future use.

Flexible delivery

Today’s data-driven enterprise environments require data delivery to humans and algorithms alike. That means different modes and formats. The Fabric delivers data in batch, real time, and streaming modes and recommends the most appropriate mode for the application based on use case description.

It enables an easy way to read from multiple different sources and write data to multiple different targets with either auto-generated or user-maintained transformation logic. This same plug-n-play approach lets users configure a data processing pipeline to work in any of the above-mentioned modes. No complex configuration needed.

Reusable Data Pipeline

Built-in policies

The Fabric automatically protects the data that it accesses on behalf of the actors using it. This could take the form of masking or redacting data and metadata. Similarly to data quality, automation is achieved by applying policies to data indirectly—based on business term classification, i.e., via the metadata layer.

Automated data quality

As mentioned above, we believe data quality should be part of any data fabric. This will enable the delivery of data that is valid, reliable, and fit for purpose, regardless of the use case.

Delivering high-quality data consistently requires a certain degree of automation, and here is how we achieve it:

  • Active metadata: use the available metadata to suggest and apply data quality rules. For example, tie a rule to a data domain, such as email. Whenever email data goes through the Fabric, it is automatically validated and cleansed.
  • AI: Without configuring anything, data is checked for anomalous changes in its characteristics. By the same token, email data can be detected in any data source, which enables the automation mentioned in the previous bullet.
  • Configure once, re-use everywhere: You don't have to configure the same rule again for different data sources. No matter whether your data is in an Excel file, Snowflake DWH, AWS, Redshift, or Google BigQuery, the same rule configuration will be used for all of them. Ataccama processing is data source agnostic.

Incorporating these features, the Fabric automates data profiling/discovery and classification (business term/domain detection), anomaly detection, data quality monitoring, cleansing, standardization, enrichment, and master data consolidation. Furthermore, these features and principles make the configuration much faster and easier, as well as facilitating rule sharing across the organization.

Including data quality into your data fabric design increases data accuracy by up to 40% and speeds data processing by up to 60% for some use cases.

Source: our measurements on a complex MDM implementation

Platform-agnostic distributed data processing

Data processing speed matters.

That’s why processing data where it resides is so important, besides other benefits. We call this edge computing, and here is how it works.

The Fabric can have processing engines installed at any number of locations in the cloud or on premises. While the data is processed locally, the metadata (information about the results of processing) is sent to a central location: the data catalog.

Data source agnostic edge processing lets you quickly integrate data from any kind of sources on the fly. For example, you can merge data from a Hadoop file, an RDBMS table, and an Excel spreadsheet, without worrying about connectivity.

Edge Computing Narrow

Building your own Data Quality Fabric

Building a data quality fabric at your organization is not easy but we're here to help. If you're interested to learn more about how you could benefit from an integrated layer of data and connecting processes, we'd love for you to get in touch with us and schedule a demo.