This post is also available in: हिन्दी (Hindi) العربية (Arabic)
Machine learning is a sub-field of artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data. Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task. Machine learning has been at the forefront of many technological advancements in recent years such as self-driving cars, computer vision, and speech recognition systems.
Basic Machine Learning Terms
Here are 11 machine learning terms that kids and every beginner in machine learning should know:
Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaning structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Let’s understand the clustering technique with the real-world example of Mall: When you visit any shopping mall, you observe that the items with similar usage are grouped together. Such as the t-shirts are grouped in one section, and trousers are in the other sections. Similarly, at vegetable sections, apples, bananas, mangoes, etc. are grouped in separate sections, so that you can easily find out the items. The clustering technique also works in the same way.
The clustering technique is widely used in various tasks. For example, it is used by Amazon in its recommendation system to provide recommendations as per the past search of products. Netflix also uses this technique to recommend movies and web series to its users as per the watch history.
Some of the most common uses of this technique are:
- Market Segmentation
- Statistical Data Analysis
- Social Network Analysis
- Image Segmentation
- Anomaly Detection
Data scientists use many different kinds of machine learning algorithms to discover patterns in big data that lead to actionable insights. At a high level, these different algorithms can be classified into two groups based on the way they “learn” about data to make predictions: supervised and unsupervised learning. Regression analysis is one of the techniques used in supervised machine learning algorithms.
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x). The goal of a regression model is to build a mathematical equation that defines y as a function of the x variable(s). Next, this equation can be used to predict the outcome (y) on the basis of the new values of the predictor variable(s) x.
There are various types of regressions that are used in data science and machine learning. Each type has its own importance in different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Some of the most common regression models are:
- Linear Regression
- Logistic Regression
- Polynomial Regression
- Support Vector Regression
- Decision Tree Regression
- Random Forest Regression
- Ridge Regression
- Lasso Regression
Let’s consider an example to understand the concept of regression. Suppose there is a marketing company A, who does various types of advertisements every year and gets sales on that. The below list shows the amount spent on advertisement and the corresponding sales in the past 5 years:
|Advertisement (in hundred dollars)||Sales (in hundred dollars)|
During the 6th year, a company has allotted $20,000 for advertisement and wants to know the possible sales it will generate.
In the above scenario, we generally use a linear regression model of the form y = ax + b, where y (sales) is dependent variable (predictor), x (advertisement) is an independent variable and a and b are regression constants.
The database is a necessary component in machine learning. If you want to establish a machine learning system, you will need to either collect data from public resources or generate new data. All datasets that are used for machine learning combined together to form the database. Generally, data scientists divide data into three categories:
- Train Dataset: The train dataset is used for training models. Through training, machine learning models will be able to recognize the important features of data.
- Validate Dataset: Validate dataset is used for trimming models’ coefficients and to compare models to pick out the optimal one. Validate dataset is different from train dataset, and it cannot be used in the training section or overfitting may occur and adversely affect new data generation.
- Test Dataset: Once the model is confirmed, a test dataset is used for testing the model’s performance in a new dataset.
In traditional machine learning, the ratio of these three datasets is 50 : 25 : 25; however, some models don’t need much tuning or the train dataset can actually be a combination of training and validation, hence the ratio of train : test can be 70 : 30.
4. Natural Language Processing
Natural language processing (NLP) is a very common concept for machine learning. It had made it possible for a computer to read human language and incorporate it into all kinds of processes.
The most common applications of NLP include:
- Text Classification and Sorting: This deals with classifying texts into different categories, or sorting a list of texts based on relevancy. For example, it can be used to screen out spam mails (by analyzing whether the emails are spam emails or not), or business-wise it can also be used to identify and extract information related to your competitors.
- Sentiment Analysis: With sentiment analysis, a computer will be able to decipher the sentiments, such as anger, sadness, delight, etc. through analyzing text strings. So basically a computer will be able to tell whether people are feeling happy, sad, or angry as they are typing in the words or sentences. This is widely used in customer satisfaction surveys to analyze how customers are feeling towards a product.
- Information Extraction: This is mainly to summarize a long paragraph into a short text, much like creating an abstract.
- Named-Entity Recognition: Suppose you have extracted a bunch of messy profile data such as an address, phone number, name, and much more all mixed up with one another. Won’t you wish you can somehow clean this data so magically they are all identified and matched to the proper data types? This is exactly how named-entity recognition helps to turn messy information into structured data.
- Speech Recognition: Speech recognition also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format.
- Natural Language Understanding and Generation: NLU is to use a computer to transform human expressions into computer expressions. On the contrary, natural language generation is to transform computer expressions into human expressions. This technology is very commonly used for humans communicating with robots.
- Machine Translation: Machine translation automatically translates texts into another language (or to any specific language).
5. Overfitting and Underfitting
Overfitting and underfitting are two main problems that occur in machine learning and degrade the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. Here generalization is the ability of a machine learning model to provide a suitable output by adapting the given set of unknown input. It means after providing training on the dataset, it can produce reliable and accurate results. Hence, overfitting and underfitting are the two terms that need to be checked for the performance of the model and whether the model is generalizing well or not.
Before moving on to these two terms let’s first understand these two related terms – Bias and Variance.
- Bias: Assumptions made by a model to make a function easier to learn.
- Variance: If you train your data on training data and obtain a very low error, upon changing the data and then training the same previous model you experience a high error, this is called variance.
Underfitting: A machine learning model is said to have underfitting when it cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model does not fit the data well enough.
It usually happens when we have fewer data to build an accurate model and also when we try to build a linear model with non-linear data. In such cases, the rules of the machine learning model are too easy and flexible to be applied on such minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection.
Underfitting – High bias and low variance.
Overfitting: A machine learning model is said to be overfitted when we train it with a lot of data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. In this case, the model does not categorize the data correctly, because of too many details and noise.
The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can readily build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.
Overfitting – Low bias and high variance.
6. Supervised Learning
Supervised learning is a family of machine learning models that teach themselves by example. This means that data for a supervised machine learning task needs to be labeled (assigned the right, ground-truth class). For example, if we would like to build a machine learning model for recognizing if a given text is about marketing, we need to provide the model with a set of labeled examples (text + information if it is about marketing or not). Given a new, unseen example, the model predicts its target – e.g., for the stated example, a label (e.g., 1 if a text is about marketing and 0 otherwise).
7. Unsupervised Learning
Unsupervised learning models teach themselves by observation. The data provided to that kind of algorithm is unlabeled (there is no ground truth value given to the algorithm). Unsupervised learning models are able to find the structure of relationships between different inputs. The most important kind of unsupervised learning technique is “clustering”.
8. Reinforcement Learning
Reinforcement learning differs in its approach from supervised and unsupervised learning. In reinforcement learning, the algorithm plays a “game”, in which it aims to maximize the reward. The algorithm tries different approaches or “moves” using trial-and-error and sees which one boosts the most profit.
The most commonly known use cases of reinforcement learning are teaching a computer to solve a Rubik’s Cube or play chess, but there is more to reinforcement learning than just games. Recently, there is an increasing number of reinforcement learning solutions in Real-Time Bidding, where the model is responsible for bidding a spot for an ad and its reward is the client’s conversion rate.
9. Neural Network
Neural networks are a very wide family of machine learning models. The main idea behind them is to mimic the behaviour of a human brain when processing data. Just like the networks connecting real neurons in the human brain, artificial neural networks are composed of layers. Each layer is a set of neurons, all of which are responsible for detecting different things. A neural network processes data sequentially, which means that only the first layer is directly connected to the input. All subsequent layers detect features based on the output of a previous layer, which enables the model to learn more and more complex patterns in data as the number of layers increases. When the number of layers increases rapidly, the model is often called a Deep Learning model.
10. Computer Vision
Computer vision is an artificial intelligence field focusing on analyzing and understanding image and video data. The problems we often see in computer vision include:
- Image Classification: Image classification is a computer vision task that teaches a computer to recognize certain images. For example, training a model to recognize particular objects that appeared in any specific place.
- Target Detection: Target detection is to teach models to detect a particular class from a series of predefined categories, and use rectangles to circle them out. For example, target detection can be used to configure face recognition systems. The model can detect every predefined matter and highlight them out.
- Image Segmentation: Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.
- Significance Test: Once sample data has been gathered through an observational study or experiment, statistical inference allows analysts to assess evidence in favour of some claim about the population from which the sample has been drawn from. The methods of inference used to support or reject claims based on sample data are known as tests of significance.
TensorFlow is the open-source library developed by Google to carry out machine learning projects. It was created by the Google Brain team and released in 2015 under the Apache 2.0 license. Today, it is one of the most widespread tools in the world of machine learning, particularly for the construction of networks of neurons.
Although TensorFlow is used mainly in the machine learning area, it can also be used for other types of algorithms that require numerical calculation tasks using data graphs.
There are other alternatives to TensorFlow in the market such as PyTorch from Facebook and MXNet from Amazon.
Image Credit: Background vector created by starline – www.freepik.com