What is Data Preparation in Machine Learning?
Data preparation (also referred to as “data pre-processing”) is the process of transforming raw data so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions.
Steps in Data Preparation
Reducing the time necessary for data preparation has become increasingly important, as it leaves more time to test, tune, and optimize models to create greater value. To prepare data for both analytics and machine learning initiatives teams can accelerate machine learning and data science projects to deliver an immersive business consumer experience that accelerates and automates the data-to-insight pipeline by following six critical steps:
1. Data Collection
As a society, we’re generating data at an unprecedented rate. These data can be numeric (temperature, loan amount, customer retention rate), categorical (gender, color, highest degree earned), or even free text ( doctor’s notes or opinion surveys).
Data collection is the process of gathering and measuring information from countless different sources. In order to use the data we collect to develop practical artificial intelligence and machine learning solutions, it must be collected and stored in a way that makes sense for the business problem at hand.
2. Data Exploration and Profiling
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.
These characteristics can include the size or amount of data, completeness of the data, the correctness of the data, possible relationships amongst data elements or files/tables in the data.
Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics.
This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data.
All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis.
Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements, and defining relevant relationships across datasets. This process is also known as determining data quality.
3. Data Formatting
The next step in data preparation is to ensure your data is formatted in a way that best fits your machine learning model. If you are aggregating data from different sources, or if your data set has been manually updated by more than one person, you’ll likely discover anomalies in how the data is formatted (e.g. USD5.50 versus $5.50).
In the same way, standardizing values in a column, e.g. State names that could be spelled out or abbreviated) will ensure that your data will aggregate correctly. Consistent data formatting takes away these errors so that the entire data set uses the same input formatting protocols.
4. Improving Data Quality
Here, the process starts by having a strategy for dealing with erroneous data, missing values, extreme values, and outliers in your data. Self-service data preparation tools can help if they have intelligent facilities built in to help match data attributes from disparate datasets to combine them intelligently. For instance, if you have columns for FIRST NAME and LAST NAME in one dataset and another dataset has a column called CUSTOMER that seems to hold a FIRST and LAST NAME combined, intelligent algorithms should be able to determine a way to match these and join the datasets to get a singular view of the customer.
For continuous variables, make sure to use histograms to review the distribution of your data and reduce the skewness. Be sure to examine records outside an accepted range of value. This “outlier” could be an inputting error, or it could be a real and meaningful result that could inform future events as duplicate or similar values could carry the same information and should be eliminated. Similarly, take care before automatically deleting all records with a missing value, as too many deletions could skew your data set to no longer reflect real-world situations.
5. Feature Engineering
Feature engineering is the process of using domain knowledge to extract features(characteristics, properties, attributes) from raw data. A feature is a property shared by independent units on which analysis or prediction is to be done. Features are used by predictive models and influence results.
Features selection plays a vital role in building a machine learning model that impacts the performance and accuracy of the model. It is that process that contributes mostly to the predictions or output that we need by selecting the features automatically or manually. If we have irrelevant data that would cause the model with overfitting or underfitting.
- The benefits of feature selection are:
- Reduce the overfitting/underfitting
- Improves the accuracy
- Reduced training/testing time
- Improves performance
6. Splitting Data into Training and Evaluation Sets
The final step is to split your data into two sets; one for training your algorithm, and another for evaluation purposes. Be sure to select non-overlapping subsets of your data for the training and evaluation sets in order to ensure proper testing. Invest in tools that provide versioning and cataloging of your original source as well as your prepared data for input to machine learning algorithms, and the lineage between them. This way, you can trace the outcome of your predictions back to the input data to refine and optimize your models over time.
Factors Affecting the Quality of Data in Data Preparation
The data preparation process can be complicated by issues such as:
1. Missing or Incomplete Records
It is difficult to get every data point for every record in a dataset. Missing data sometimes appears as empty cells, values (e.g., NULL or N/A), or a particular character, such as a question mark. For example:
|50 – 60||?|
|20 – 30||50 – 75|
|80 – 90||NULL|
|50 – 60||N/A|
|50 – 60||?|
|70 – 80||60 – 70|
2. Outliers or Anomalies
Unexpected values often surface in a distribution of values, especially when working with data from unknown sources which lack poor data validation controls.
3. Improperly Formatted / Structured Data
Data sometimes needs to be extracted into a different format or location. A good way to address this is to consult domain experts or join data from other sources.
4. Inconsistent Values and Non-Standardized Categorical Variables
Often when combining data from multiple sources, we can end up with variations in variables like company names or states. For instance, a state in one system could be “Texas,” while in another it could be “TX.” Finding all variations and correctly standardizing will greatly improve the model accuracy.
5. Limited or Sparse Features / Attributes
Feature enrichment, or building out the features in our data often requires us to combine datasets from diverse sources. Joining files from different systems is often hampered when there are no easy or exact columns to match the datasets. This then requires the ability to perform fuzzy matching, which could also be based on combining multiple columns to achieve the match. For instance, combining two datasets on CUSTOMER ID (present in both data datasets) could be easy. Combining a dataset that has separate columns for CUSTOMER FIRST NAME and CUSTOMER LAST NAME with another dataset with a column CUSTOMER FULL NAME, containing “Last name, First name” becomes more tricky.