56,000+ of databases and 5000+ applications with over 8 PB of data to scan
Enterprise-wide PII Protection
Automated data discovery and cataloging of sensitive data
Hundreds of data sources, including:
The telecommunications provider is one of the largest companies in North America and holds massive amounts of data, especially sensitive information regarding their customers, partners, and other stakeholders.
Because the organization has been growing substantially in recent years, either organically or through mergers and acquisitions, it is at a point where it has to handle and secure more data than ever.
In 2021, the company realized they needed an automated solution for identifying and protecting PII it stores.
The company had a clear purpose in mind: to ensure all private data in their ecosystem was secured. The number one priority was implementing privacy at scale, followed by deploying automated data quality as a by-product of leveraging their distributed environment and newly implemented data quality management solution.
The data governance team came up with the “Data Scanning at Scale” initiative, which meant scanning an estimated 5000+ apps and 56000+ databases. All in all, an estimated 8 petabytes of structured & unstructured data in on-prem and in-cloud environments would have to be parsed through in an 18-month window.
By implementing this framework, they could make sure sensitive information was always secure and protected the communities they serve.
Implementing privacy at scale – scanning and classifying 56000 databases
The sheer volume of data and the number of data sources to scan resulted in challenges that required not only technology but relations and people management.
Because of their federated distributed environment, they had to work with multiple IT, engineering, and shadow IT organizations inside the company, which meant getting access to and scanning a large number of various data sources. Based on previous implementations and tools used, these were some of the most common issues they were prepared to face:
- Scans would not complete successfully and fail while running
- Delays in onboarding data would leave scan teams waiting
- Numerous rescans due to lack of thorough data quality remediation processes
- Omitting production-level tables from completed scans
Another challenge was directly connected to their complex approach: identifying a tooling solution to onboard different types of data sources and automate the data discovery and cataloging process.
Finding the right solution for an ever-growing infrastructure
Our client is one of the world’s biggest telecommunications companies and deals daily with large amounts of legacy and newly added data. The requirements were clear and called for a robust tool that would have the following:
- Built-in data patterns
- Regular expressions
- String matches
- Pre-configured business glossary
- Proximity sensing
- Correctness algorithms
After analyzing multiple vendors, the company decided to do a proof of concept project that would run for 24 hours. The purpose was to challenge the shortlisted vendors to scan through multiple environments such as Oracle, Azure, Snowflake, and AWS and successfully parse through as much data as possible. Ataccama managed to scan 138,972 tables and apply custom extraction rules to narrow the list to 22,494 tables of sensitive data.
Ataccama was ultimately selected for its scanning and classification capabilities, total cost of ownership, integration flexibility, and future-proofing potential, among many other criteria.
“Looking at all the vendors, a lot of them had some great solutions, but holistically with all the things we were looking for in a tool, Atacama was the most mature solution.”
Data Scanning at Scale Project Leader
Successful implementation of the tool would translate into:
- Scanning multiple systems at the same time
- Detecting unknown systems on any network that need to be scanned
- Allowing data stewards to create new data classifiers/labels/tags to automatically and effectively discover data
- Using accepted or rejected status to improve success in the future scans
- Integrating with other systems and feeding results or issues
- Discovering unknown applications & data stores and feeding the information into routine scans
Implementing a complex framework via a metadata-driven and AI-powered automation
The “Scanning at Scale” approach revolves around specific steps that have to be followed in order to catalog & remediate data. Ataccama is having a significant impact on almost the entire process, thanks to the complete nature of the ONE Gen2 platform. Here is an overview of each step and what role Ataccama plays:
Working with various organizational stakeholders to easily connect the Ataccama engine to any data source or network of sources. Because of the tools’ integration capabilities, our team made sure there were no delays in connecting to any data source.
Connect the data source to the data catalog, test the connection, and flag the source as “ready for scanning.” Within the business glossary, the data governance team could deploy and test the built-in scanning and post-processing rules while also developing additional custom DQ rules.
During the scanning process, the engine that was previously connected to the data source starts to profile samples of data. It analyzes statistics and applies any classification rules necessary. The results are saved in the catalog for reporting.
The engine then analyzes classification rules, identifies sensitive data assets, and imports only those assets to the catalog. Automated post-processing is applied to identify false positives and flag tables for review. Lastly, Ataccama’s strong AI fingerprinting can find patterns in existing data and detect similar data assets in future scans.
The catalog also serves as a library where all results can be managed and reviewed. Further, the data governance team can initiate any necessary actions like searching for related data assets, updating metadata information, and setting up further data validation.
During the validation process, users have to review flagged sensitive tables inside the catalog and set up business rules on domains that would be needed for the remediation process.
7. Completion & Final Reporting
Actions like exporting data for various reports, transforming output for reporting tools (PowerBI), and generating executive-level progress reports need to be made easy.
24x7 operations for scanning and remediation of data
The project is still ongoing, and the data governance team is tackling the steps mentioned above with the help of Ataccama ONE Gen 2. They are on track with the number of remediated systems, databases marked for remediation, and tables or views that have been remediated. The organization is on the fast track to securing private data at scale and providing data quality as a service to all parts of the company.
“With Ataccama, we are able to do reviews of triple the number of databases in half the time.”
Data Scanning at Scale Project Leader
These are just a few of the results achieved so far:
- Ramped up to 24x7 operations for scanning and remediation as of July 2022.
- Achieved the goal of scanning 100 databases daily with more than 260 tables a minute for an output of 3000 databases a month.
- Scanned 15,000 databases to date.
Automated data protection and data quality solution
Besides the immediate goal of scanning all existing databases and apps, the data governance team is looking to:
- Proactively scan all incoming data and identify new PII.
- Create and build a comprehensive data asset inventory and protect all customer data
- Identify data redundancy opportunities to eliminate duplicate data environments, processes, and teams.
- Create a centralized Data Marketplace to deliver unified data sets, reports, and analytics
- Implement automated data quality with several million data quality rules in production and develop a data quality as a service initiative, where every unit inside the organization can track data and make sure it is secured.