The image above was drawn by an A.I. when we tasked it with creating artwork for the words "Anomaly Detection."
17,000 British men reported being pregnant between 2009 and 2010, according to the Washington Post. These British men sought care for pregnancy-related issues like obstetric checkups and maternity care services. However, this wasn't due to a breakthrough in modern medicine! Someone entered the medical codes into the country's healthcare system incorrectly. Simply put, the data was recorded poorly, and there was no quality check to spot the error!
It's hard to blame Britain's healthcare services. Quality, by definition, is subjective. Creating quality checks for every possible variety of incorrect data is a massive feat. It's hard to expect even the most data-mature companies to anticipate every mistake. However, if there was a way to employ AI/ML, these solutions could learn from our datasets independently. They could spot errors like this without us having to explicitly say, "if entry is male, then don't provide maternity care."
Actually, there is. It's called anomaly detection.
What is a data quality check?
Before we can get into the marvels of Anomaly detection, we have to understand what a data quality check is (and how it works).
A data quality check specifies standards for data's dimensions, i.e., its completeness, validity, timeliness, uniqueness, accuracy, and consistency. Data either fails or satisfies those standards, revealing information about its quality (is it high or low). You can learn more about data quality and its significance here.
Data quality rules will specify what the user defines as high-quality data. For example, a hospital might define a senior patient as over the age of 60. A simple data quality rule could have the following form:
Rule: Senior Patient Age > 60
In reality, each hospital might have different standards for defining a senior patient. Thus, they might define these rules differently. This way, a company can define various rules to identify problematic data. These rules are then added to a "rules library" and used during data quality monitoring to identify low-quality entries.
Once your company populates this rules library, you'll develop a standard or "usual behavior" that you expect your data to adhere to. Data that doesn't stick to those standards is invalid, incomplete, inaccurate, etc.
For example, in our senior patient rules above, if one applicant's age is 35 and a user labels them as a "senior patient," this data point would be invalid.
What is anomaly detection?
However, there's also a way to find data points that differ from usual behavior without needing to write DQ rules. It's called anomaly detection. Instead of DQ rules, it uses ML/AI to scan your data to discover patterns and expected values unique to your datasets. Once it learns how your data system works, it can automatically find data that doesn't fit the norm (or falls outside of these patterns) and flag the entries to alert the relevant parties. The values that don't fit these standards are called "anomalies."
Once you receive an alert about an anomaly, you will find information on why the anomaly detection service labeled that entry anomalous. For example, suppose a hospital recorded having 10,000 patients in February, and the healthcare system receives an alert, flagging this entry as anomalous. It would explain via context in the dataset: this hospital usually has around 1,000 patients per month. This sudden jump was unexpected (or displayed as a graph that conveys this information).
You can then take that information and decide whether it's an anomaly or a normal data point. Maybe that hospital had a big jump in patients due to COVID-19. Depending on how you respond, some anomaly detection algorithms can learn from this feedback and become more accurate in detecting anomalies in the future. Anomaly detection is a fundamental part of any automated data quality workflow. For example, data observability (see how it looks here).
In our introduction hospital example above, let's say that all people applying for pregnancy-related services were labeled with the tag "PREG." If the vast majority of patients using those services had an "F" (female) in the gender column, anomaly detection would notice right away if a patient who was "M" (male) received the "PREG" tag. You wouldn't need to write the rule "PREG must be F" to prevent this mistake from happening.
Different types of anomalies
Various business personas have different ways of defining anomalies in their data.
The marketing team might get an abnormal amount of webinar registrations, receive more inbound requests from domains in one company than they usually do, or have too many requests from one country (more than the norm). These anomalies can affect their work performance and would be labeled as critical.
Data engineers might be more interested in conflicting information about the same entity (like a customer) in two different systems.
A data scientist might see average sales numbers for a random Thursday in February. However, that Thursday was a public holiday when sales are expected to triple. This is for sure a critical anomaly as well!
Therefore, you could say that anomaly definition and, consequently, anomaly detection are rather subjective. The important part to remember is that anomaly detection services must be capable of detecting all forms of abnormalities. At Ataccama, we like to define anomalies based on their proximity to the data. Moving from high-level (far from the actual data, more general info about the data itself) to low-level (anomalies in the columns of data, row by row, specific values/data points), we can define anomalies in three categories, metadata, transactional data, and record data.
Metadata is data that uses metrics to describe actual underlying data. For example, data quality metadata refers to information about the quality of data resources (databases, data lakes, etc.). Metadata allows you to organize and understand your data in a way that is uniquely meaningful to your use case, keeping it consistent and accurate as well.
Anomalies at this level deal with data "in general" and are the farthest in proximity to the data itself. These are anomalies about the data, not in the data (however, they can still signify a problem in the data). They can occur when there's an unexpected drop in data quality; when a dataset/point is usually labeled one way but has been tagged in another; or a certain number of records is missing, too low, or too high, and any other unexpected occurrence when extracting data about the data you have stored.
Transactional data anomalies
Moving on from metadata and closer to specific data, we arrive at the middle level – transactional data. We call this the middle level because you are dealing with values from the actual data but through an aggregated lens (i.e., every five days or every five minutes). Transactional Data usually contains some form of monetary transaction because the ability to analyze such data can be very beneficial. For example, if you had an aggregation of sales for every five minutes, you could use it to determine the busiest hour, if it's worth staying open past 8 p.m, etc.
Anomalies occurring at this level could be an unexpected increase in sales during one of the slower weeks of the year, a shopping holiday receiving similar totals as a typical day of the week, or one branch's performance dipping unusually low in a busy month, etc.
At the record level, anomaly detection flags specific values that are suspicious in the dataset. These values could be labeled as anomalous if one of the data points is missing, incomplete, inconsistent, or incorrect.
Our intro is an excellent example of a record-level anomaly. One of the values in the dataset (the gender) was unexpected and didn't coordinate with other values from the system. This is just one row of information, a part of a bigger set that could contain the patient's age, previous conditions, height, weight, etc.
Anomaly detection at the record level explores the dataset row by row in each table and column, looking for any inconsistencies. It can reveal issues in data collection, aggregation, or handling.
Types of anomaly detection
Now that we understand the different types of anomalies, we can get into the different approaches to detecting them. One approach focuses on time as the primary context of the data, while the other focuses on finding anomalies in the context of normal behavior. These two types of anomaly detection are called time-dependent and time-independent.
Time-dependent anomaly detection
Time-dependent data evolves over time (think about our example of transactional data), so it’s important to know when a value is captured, when it was inputted, in what order multiple entries arrived, etc. Usually, users will group (aggregate) such data together (for example, hourly or daily) and look for anomalies or trends on a group level, spotting outliers contextually.
For example, when you have daily data (i.e., recorded once per day), you can expect some seasonality. In other words, Mondays' expected value might differ from Tuesdays.' Therefore, different values might be anomalous on different days. Furthermore, such data often changes throughout longer periods. This can be expressed by the trend of the data or by a drift change in the data. All these patterns need to be captured by the time-dependent anomaly detection algorithm.
Here we see how an anomaly detection service would spot and flag a time-dependent anomaly. This time series contains seasonal, trend, and residual components.
Time-independent anomaly detection
Any data without the time dimension can be considered "time-independent." In other words, when the data was created, inputted into the system, what order it arrived in, etc., are unimportant. Only the actual values matter. Therefore the algorithm only needs to learn what the expected values are or, better, put them into "clusters of normality."
These anomalies are more relevant for master data (as opposed to transactional data): customer records, product data, reference data, and other "static data."
Here we see how an anomaly detection service would spot and flag a time-independent anomaly. It identifies "clusters of normality" (in blue) among multiple datasets.
Essential features of anomaly detection
When investing in an anomaly detection service (or a platform with anomaly detection capabilities), there are four requirements to keep in mind: extensive, self-driving, holistic, and explainable. In the following figure, you can see these characteristics with a detailed explanation below.
Users often need a versatile tool that covers a wide variety of data. Anomaly detection algorithms should work reasonably well on very different data, such as few data points, many data points, near-constant data, high variance data, many anomalies, rare anomalies, and containing changepoints (drifts).
The anomaly detection tool should also be useable by many people with different roles in their organizations (with varying levels of technicality). Users employ anomaly detection services to reduce their workload – not to add an extra burden! Therefore, the configurability of the algorithm should be minimal so that users don’t get too distracted or discouraged before they even start analyzing the data.
Users usually need anomaly detection to cover the majority, if not all, of their data pipeline. They might have trouble with anomalies in transaction data, metadata, record level, or all three. Additionally, their needs may change over time. They need a complete solution that is scalable to their changing demands. At high to low levels, you need to recognize anomalies as close to or as far from the data as you wish.
People love AI because it does part of their work instead of them. At the same time, people are often uncomfortable giving up control and letting AI make decisions for them. That's why they want suggestions that AI makes to be explainable so that they understand how AI got there.
This is relevant not only to DQ and anomaly detection but also to topics of ethical AI. For example, to prevent discrimination against certain groups of loan applicants (in banking). By having transparency in AI's decision-making, algorithms can be updated to be more "ethical."
In summary, anomaly detection algorithms allow you to spot unwanted or unexpected values in your data without having to specify rules and standards. It takes a snapshot of your datasets and identifies anomalies by comparing new data to the patterns it has found in the past about the same or similar datasets.
As far as expectations for what an anomaly detection tool can do:
- Whether these anomalies occur at a high level (like with metadata) or close to the actual data itself (like with record-level anomalies), your anomaly detection service needs to be able to spot them.
- Both time-dependent and time-independent anomaly detection are necessary for it to apply to all types of data.
- Your service also must be able to handle diverse data types, be easily usable and adaptable, and have explanations available when it flags values as anomalous.
The field of anomaly detection continues to grow and evolve. AI/ML is becoming more widely adopted and implemented across the data management space. We can expect anomaly detection to become increasingly proactive, not reactive. These tools will be able to spot problematic data before it enters downstream systems where it can cause harm.
Anomaly detection is valuable because it usually reveals an underlying problem beyond the data, such as defective machines in IoT devices, hacking attempts in networks, infrastructure failures in data merging, or inaccurate medical checks. These issues are often difficult to anticipate and, consequently, write DQ rules for. Therefore, AI/ML-based anomaly detection is the best method for spotting them.
Now that you know how valuable anomaly detection is, you're probably eager to implement it for your data management initiative. Ataccama ONE has anomaly detection built into our data catalog so you can organize your data and spot anomalies all in the same place. Learn more on the platform page!