Data Preparation

If your data isn’t ready to be accurately analyzed, it probably needs to undergo data preparation. Data preparation manipulates raw data to fit it into a company’s desired/usable format. It can include processes like ETL (extract, transform, load), enrichment, cleansing, data fusion, augmentation, and delivery. It is usually the first step in any analytics project.

For example:

Suppose a data set has differently formatted telephone numbers (e.g., 123.456.789 vs. 123456789), but your system only accepts data in one format. In that case, you could use data preparation to change those entries into the correct format. 

Several tasks are involved in data preparation; each is unique to the dataset you’re working on and the desired result. However, it typically follows a few steps: access the data, load it, cleanse, format, combine, and analyze. As data management systems evolve, many of these processes have become automated. It’s now possible to automatically prepare data before it reaches its’ final destination by implementing data governance and DQ rules across the data pipeline. 

Data scientists spend about 80% of their time preparing data. Having advanced knowledge and protocols regarding data preparation can cut this time in half and allow your scientists to focus on what’s important – actually using the data. 

Having fully available and prepared datasets can also expand the number of users who can work with data. Nontechnical or business users can work with data from all of your company's sources without needing to bother a scientist or the IT department to prepare it for them.