7 Basic Steps of Machine Learning

This post is also available in: हिन्दी (Hindi) العربية (Arabic)

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

The process of learning begins with observations or data, such as examples, direct experience, or instruction, to look for patterns in data and make better decisions in the future on the examples that we provide. The primary aim is to allow computers to learn automatically without human intervention or assistance and adjust actions accordingly.

Let’s understand the 7 steps of machine learning with examples.

7 Steps of Machine Learning

The process of machine learning can be broken into 7 steps. These steps are

  • Data Collection
  • Data Preparation
  • Choosing a Model
  • Training a Model
  • Evaluating a Model
  • Hyperparameter Tuning 
  • Prediction

To understand these steps let’s consider a model being trained to differentiate between fruit as either apple or orange. In the real-world machine learning is capable of much more complex tasks. However, to explain the process, the above example is taken into consideration.

1. Data Collection

Data collection is the process of gathering and measuring information from different sources. To use the data we collect to develop practical artificial intelligence (AI) and machine learning solutions, must be collected and stored in a way that makes sense for the business problem at hand.

To use a model to differentiate between fruit different parameters can be used to classify a fruit as an apple or orange. To keep things simple, let us take only 2 features that our model would utilize – colour and shape. Using these features, we would hope that our model can accurately differentiate between these 2 fruits.

Thus, the data collected in this can be summarized as:

ColourShapeFruit
RedRound ConicalApple
OrangeRoundOrange

A mechanism would be needed to gather the data for our 2 chosen features. For instance, for collecting data on colour, we may use a spectrometer and, for the shape, we may use pictures of the fruit. To collect data, we would try to get as many different types of apples and oranges as possible to create to set for our features.

Collecting data allows you to capture a record of past events so that we can use data analysis to find recurring patterns. From these patterns, you build predictive models using machine learning algorithms that look for trends and predict future changes.

Predictive models are only as good as the data from which they are built, so good data collection practices are crucial to developing high-performing models. The data need to be error-free and contain relevant information for the task at hand. For example, the proposed model considered in an example would not benefit from the size of the fruit as the size of apple and orange can be the same.

2. Data Preparation

Data preparation may be one of the most difficult steps in any machine learning project. The reason is that each dataset is different and highly specific to the project. 

Data preparation (also referred to as “data preprocessing”) is the process of transforming raw data so that it can run through machine learning algorithms to make predictions.

For our example, once we have gathered the data for the two key features, our next step would be to prepare data for further steps. A key focus of this stage is to recognize and minimize any potential biases in our data sets for the 2 features. This is because we do not want the data to have any bearing on the model’s choices. Furthermore, we would examine our data sets for any skewness towards a particular fruit. This again would help in identifying and rectifying a potential bias as it would mean that the model would be adept at identifying one fruit correctly but may struggle with the other fruit.

The data preparation process can be complicated by issues such as:

  • Missing or incomplete records: It is difficult to get every data point for every record in a dataset. Missing data sometimes appears as empty cells, values, or a particular character.
  • Outliers or Anomalies: Unexpected values often surface in a distribution of values, especially when working with data from unknown sources which lack poor data validation controls. 
  • Improperly formatted/structured data: Data sometimes needs to be extracted into a different format or location. A good way to address this is to consult domain experts or join data from other sources.
  • Inconsistent values and non-standardized categorial variables: Often when combining data from multiple sources, we can end up with variations in variables like in our example “RD” recorded in place of a round.

3. Choosing a Model

The selection of the model type is our next course of action once we are done with the data-centric steps. Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset. The model selection can be applied both across different types of models (e.g., logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM).

These models are designed with different goals in mind. For instance, some models are more suited to dealing with text while another model may be better equipped to handle images. Concerning our model, a simple linear regression model is suitable for differentiating between fruit. In this case, the type of fruit would be our dependent variable while the colour of the fruit and shape of the fruit would be 2 predictors or independent variables.

4. Training a Model

Training a model is a major part of the machine-learning process. The bulk of the “learning” is done at this stage. Here we use the part of the data set allocated for training to teach our model to perform prediction.

As in the case of our example, the data set collected in the earlier stages is allocated for training to teach our model to differentiate between the 2 fruit. If we view our model in mathematical terms, the inputs i.e., our 2 features would have coefficients. These coefficients are called the weights of features. The process of determining the values is of trial and error. Initially, we pick random values for them and provide inputs. The achieved output is compared with the actual output and the difference is minimized by trying different values. The iterations are repeated by using different entries from our training data set until the model reaches the desired level of accuracy.

5. Evaluating a Model

With the model trained, it needs to be tested to see if it would operate well in real-world situations. That is why the part of the data set created for evaluation is used to check the model’s proficiency. This puts the model in a scenario where it may encounter situations that were not a part of its training. 

In our case, it would mean trying to identify a type of apple or an orange that is completely new to the model. However, through its training, the model should be capable enough to predict whether the fruit is an apple or an orange.

Evaluation becomes highly important when it comes to commercial applications. Evaluation allows data scientists to check whether the goals they set out to achieve were met or not. If the results are not satisfactory then the prior steps need to be revisited so that the root cause behind the model’s underperformance can be identified and, subsequently, rectified.

6. Hyperparameter Tuning

A hyperparameter is a parameter whose value is set before the learning process begins. Hyperparameters are different from parameters, which are the internal coefficients or weights for a model found by the learning algorithm. Unlike parameters, hyperparameters are specified by the practitioner when configuring the model.

If the evaluation is successful, we proceed to the step of hyperparameter tuning. This step tries to improve upon the positive results achieved during the evaluation step. For example, we would see if we make our model even better at recognizing apples and oranges. There are different ways we can go about improving the model. 

Typically, it is challenging to know what values to use for the hyperparameters of a given algorithm on a given dataset, therefore it is common to use random or grid search strategies for different hyperparameter values.

The more hyperparameters of an algorithm that you need to tune, the slower the tuning process. Therefore, it is desirable to select a minimum subset of model hyperparameters to search or tune. One of them is revisiting the training step and using multiple sweeps of the training data for training the model.

7. Prediction

Prediction refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when forecasting the likelihood of a particular outcome. The outcome will generate values for an unknown variable for each record in the new data, allowing the model builder to identify what value will most likely be.

Our fruit model should now be able to answer the question of whether the given fruit is an apple or an orange. 

The word “prediction” can be misleading. In some cases, it does mean that you are predicting a future outcome, such as when you’re using machine learning to determine the next best action in a marketing campaign. Other times,  the “prediction” has to do with, for example, whether or not a transaction that already occurred was fraudulent. In that case, the transaction already happened, but you’re making an educated guess about whether or not it was legitimate, allowing you to take the appropriate action.

Practice Problems

  1. What is Machine Learning?
  2. What is the difference between Machine Learning and Artificial Intelligence?
  3. What are the basic steps in machine learning?

FAQs

What is Machine Learning?

Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems.

What are the various steps in machine learning?

The seven basic steps in machine learning are Data Collection, Data Preparation, Choosing a Model, Training a Model, Evaluating a Model, Hyperparameter Tuning, and Prediction.

What is data collection and preparation?

It is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.

Are data collection and data processing the same?

Collecting data is the first step in the data processing. Data is pulled from available sources, including data lakes and data warehouses. It is important that the data sources available are trustworthy and well-built so the data collected (and later used as information) is of the highest possible quality.

Conclusion

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Recommended Reading

Leave a Comment