This post is also available in: العربية (Arabic)
While working with Machine Learning projects, many of us generally ignore the two most important parts – mathematics and data. These two are very important because the results of the ML model entirely depend on the data provided by us.
In one of our articles, we discussed how CSV data is created and used in our ML projects. But before doing so it is important to understand the data. The understanding of data comes through in two ways – Statistics and Visualization.
As Python is most widely used in developing ML projects, we will look into descriptive statistics in Python.
Basic Descriptive Statistics Terms
Descriptive Statistics is about describing and summarizing data. It uses two main approaches:
- The quantitative approach describes and summarizes data numerically.
- The visual approach illustrates data with charts, plots, histograms, and other graphs.
Types of Variables in Statistics
You can apply descriptive statistics to one or many datasets or variables. These variables are classified as
- Univariate: When you describe and summarize a single variable, you’re performing univariate analysis. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be the height of a person. Suppose that the heights of seven students of a class are recorded. There is only one variable that is height and it is not dealing with any cause or relationship. The description patterns found in this type of data can be made by drawing conclusions using Central Tendency Measures (mean, median and mode), Dispersion or spread of data (range, minimum, maximum, quartiles, variance and standard deviation) and using frequency distribution tables, histograms, pie charts, frequency polygon and bar charts.
- Bivariate: When you search for statistical relationships among a pair of variables, you’re doing a bivariate analysis. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among two variables. Example of bivariate data can be temperature and ice cream sales in the summer season. Thus bivariate data analysis involves comparisons, relationships, causes and explanations. These variables are often plotted on the X and Y axis on the graph for better understanding of data and one of these variables is dependent while the other is dependent. (Temperature is an independent variable and ice cream sales is a dependent variable).
- Multivariate: When you are performing analysis with multiple variables (more than two variables), you’re performing multivariate analysis. Example of this type of data is suppose an advertiser wants to compare the popularity of four advertisements on a website, then their click rates could be measured for both men and women and relationships between variables can then be examined. It is similar to bivariate but contains more than one dependent variable. The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques are regression analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).
Types of Measures
Following are the basic types of measures evaluated in statistics:
- Central Tendency tells you about the centres of the data. Useful measures include the mean, median and mode.
- Variability tells you about the spread of the data. Useful measures include variance and standard deviation.
- Correlation or joint variability tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and correlation coefficient.
Population and Samples
In statistics, the population is a set of all elements or items that you’re interested in. Populations are often vast, which makes them inappropriate for collecting and analyzing data. That’s why statisticians usually try to make some conclusions about a population by choosing and examining a representative subset of that population.
This subset of a population is called a sample. Ideally, the sample should preserve the essential statistical features of the population to a satisfactory degree. That way, you’ll be able to use the sample to get conclusions about the population.
An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers, but here are a few to start you off:
- Natural variation in data
- Change in the behaviour of the observed system
- Errors in data collection
Data collection errors are a particularly prominent cause of outliers. For example, the limitations of measurement instruments or procedures can mean that the correct data is simply not obtainable. Other errors can be caused by miscalculations, data contamination, human error, etc.
Descriptive Statistics in Python
In this era of Big Data and Artificial Intelligence, Data Science and Machine Learning have become essential in many fields of science and technology. A necessary aspect of working with data is the ability to describe, summarize, and represent data visually. Python statistics libraries are comprehensive, popular, and widely used tools that will assist you in working with data.
These statistics libraries help you to
- describe and summarize your datasets
- calculate descriptive statistics
- visualize your datasets
Python Statistics Libraries
There are many Python statistics libraries available for you to work with, but in this article, we’ll consider some of the most popular and widely used ones:
- Statistics: Python’s statistics is a built-in library for descriptive statistics. You can use it if your datasets are not too large or if you can’t rely on importing other libraries. The built-in Python statistics library has a relatively small number of the most important statistics functions. The official documentation is a valuable resource to find the details.
- NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis. You can find the official documentation of NumPy here.
- SciPy is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.
- Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with series objects and two-dimensional (2D) data with DataFrame objects.
- Matplotlib is a third-party library for data visualization. It works well in combination with NumPy, SciPy and Pandas.
Calculating Basic Statistical Concepts in Python
The first step in using the above-mentioned libraries in Python is importing these libraries into your project.
import mathimport statisticsimport numpy as npimport scipy.statsimport pandas as pdimport matplotlib.pyplot as pltplt.style.use(‘ggplot’)
Creating Data in Python
Data can be created in Python in several ways. You can create dummy data frames using pandas and NumPy packages. Most of the time data is prepared in MS Excel and later it is imported into Python. This is not an efficient approach. The efficient approach is to prepare random data in Python and use it later for data manipulation.
Let’s create some data to work with. You’ll start with Python lists that contain some arbitrary numeric data:
x = [8.0, 1, 2.5, 4, 28.0]x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
Here we’ve created two data sets – one with pure numbers and another with NAN (not-a-number value). It’s important to understand the behavior of the Python statistics routines when they come across a nan. In data science, missing values are common, and you’ll often replace them with nan.
Measures of Central Tendency
The measures of central tendency show the central or middle values of the dataset. Many types of central tendencies can be evaluated using Python. These are
- Weighted Mean
- Geometric Mean
- Harmonic Mean
Mean: The sample mean, also called the sample arithmetic mean or simply the average is the arithmetic average of all the items in a dataset. The mean of a dataset is mathematically expressed as ∑xi/ni, where i = 1, 2, 3, …, n. You can calculate the mean with pure Python using sum() and len(), without importing any of the statistics libraries:
mean_ = sum(x) / len(x)
Although you can use a clean and elegant way of computing mean using built-in Python statistics functions:
mean_ = statistics.mean(x)mean_ = statistics.fmean(x)
statistics.fmean() always returns a floating-point number.
You can also use NumPy to find the mean
mean_ = np.mean(x)np.nanmean(x_with_nan)
Weighted Mean: The weighted mean, also called the weighted arithmetic mean or weighted average is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.
If wi is the weight for each data point xi of the dataset, where i = 1, 2, 3, …, n and n is the number of items. Then, you multiply each data point with the corresponding weight, sum all the products, and divide the obtained sum with the sum of weights: ∑(wixi)/∑wi. You can calculate the mean with pure Python using sum(), zip(), and len(), without importing any of the statistics libraries:
x = [8.0, 1, 2.5, 4, 28.0]w = [0.1, 0.2, 0.3, 0.25, 0.15]wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
However, if you have large datasets, then NumPy is likely to provide a better solution. You can use np.average() to get the weighted mean of NumPy arrays or Pandas series.
y, z, w = np.array(x), pd.Series(x), np.array(w)wmean = np.average(y, weights=w)wmeanwmean = np.average(z, weights=w)
Another solution is to use the element-wise product w * y with np.sum() or .sum():
(w * y).sum() / w.sum()
Harmonic Mean: The harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset: n/∑(1/xi), where i = 1, 2, 3, …, n and n is the number of items in the dataset.
hmean = len(x) / sum(1 / item for item in x)
You can also calculate this measure with statistics.harmonic_mean():
hmean = statistics.harmonic_mean(x)
Geometric Mean: The geometric mean is the n-th root of the product of all n elements in a dataset: n√ℾxi, where i = 1, 2, 3, …, n.
You can implement the geometric mean in pure Python as given below:
gmean = 1for item in x:gmean *= itemgmean **= 1 / len(x)
The better way of finding the geometric mean is by using the function statistics.geometric_mean(), which converts all the values to floating-point numbers and returns their geometric mean:
gmean = statistics.geometric_mean(x)
You can also get the geometric mean with scipy.stats.gmean():
Median: The sample median is the middlemost element of a sorted dataset (either in ascending or descending order). If the number of elements n of the dataset is odd, then the median is the value at the middle position: 0.5(n + 1). If n is even, then the median is the arithmetic mean of the two values in the middle, that is, the items at the positions 0.5n and 0.5n + 1.
You can find the median using pure Python (without using any statistics libraries).
n = len(x)if n % 2: median_ = sorted(x)[round(0.5*(n-1))]else: x_ord, index = sorted(x), round(0.5 * n) median_ = 0.5 * (x_ord[index-1] + x_ord[index])
You can get the median with statistics.median():
median_ = statistics.median(x)
You can also get the median with np.median():
median_ = np.median(y)
Pandas Series objects have the method .median() that ignores nan values by default:
Mode: The sample mode is the value in the dataset that occurs most frequently. If there isn’t a single such value, then the set is multinodal since it has multiple modal values.
The mode can be evaluated using pure Python as shown below:
u = [2, 3, 2, 8, 12]mode_ = max((u.count(item), item) for item in set(u))
You can obtain the mode with statistics.mode() and statistics.multimode():
mode_ = statistics.mode(u)mode_ = statistics.multimode(u)
You can also get the node with scipy.stats.mode():
u = np.array(u)mode_ = scipy.stats.mode(u)
Pandas Series objects have the method .mode() that handles multimodal values well and ignores nan values by default:
u = pd.Series(u)u.mode()
Measures of Variability
The measures of central tendency aren’t sufficient to describe data. You’ll also need the measures of variability that quantify the spread of data points. Commonly used measures of variability are:
- Standard deviation
Variance: The sample variance quantifies the spread of the data. It shows numerically how far the data points are from the mean. You can express the sample variance of the dataset with n elements mathematically as s2 = ∑(xi – mean(x))2/(n – 1), where i = 1, 2, 3, …, n and mean(x) is the sample mean.
You can use pure Python to find the sample variance as shown below:
n = len(x)mean_ = sum(x) / nvar_ = sum((item – mean_)**2 for item in x) / (n – 1)
However, the shorter and more elegant way of finding the variance is using the statistics library:
var_ = statistics.variance(x)
You can also calculate the sample variance with NumPy. In this case you use the function np.var() or the corresponding method .var():
var_ = np.var(y, ddof=1)
It’s very important to specify the parameter ddof = 1. That’s how you set the delta degrees of freedom to 1. pd.Series objects have the method .var() that skips nan values by default:
Var_ = z.var(ddof=1)
Standard Deviation: The sample standard deviation is another measure of data spread. It’s connected to the sample variance, like standard deviation, s, is the positive square root of the sample variance. The standard deviation is often more convenient than the variance because it has the same unit as the data points.
Once you get the variance, you can calculate the standard deviation with pure Python as:
std_ = var_ ** 0.5
You can also use statistics.stdev():
std_ = statistics.stdev(x)
You can get the standard deviation with NumPy in almost the same way. You can use the function std() and the corresponding method .std() to calculate the standard deviation.
pd.Series objects also have the method .std() that skips nan by default:
Skewness: The sample skewness measures the asymmetry of a data sample. There are several mathematical definitions of skewness. One common expression to calculate the skewness of the dataset with n elements is (n2 / ((n – 1)(n – 2)))(∑(xi – mean(x))3 / (ns3)).
The skewness value lies between -1 and 1 (both included). The negative skewness values indicate that there’s a dominant tail on the left side and the positive skewness values correspond to a longer tail on the right side. In the case of skewness is 0, the dataset is symmetrical.
Once you’ve calculated the size of your dataset n, the sample mean mean_, and the standard deviation std_, you can get the sample skewness with pure Python as:
x = [8.0, 1, 2.5, 4, 28.0]n = len(x)mean_ = sum(x) / nvar_ = sum((item – mean_)**2 for item in x) / (n – 1)std_ = var_ ** 0.5skew_ = (sum((item – mean_)**3 for item in x)* n / ((n – 1) * (n – 2) * std_**3))
You can also calculate the sample skewness with scipy.stats.skew():
y = np.array(x)scipy.stats.skew(y, bias=False)
Pandas Series objects have the method .skew() that also returns the skewness of a dataset:
z = pd.Series(x)z.skew()
Percentiles and Quartiles: The sample p percentile is the element in the dataset such that p% of the elements in the dataset is less than or equal to that value. Also, (100 – p)% of the elements are greater than or equal to that value. If there are two such elements in the dataset, then the sample p percentile is their arithmetic mean.
Similar to percentiles, there is another concept called quartiles. Each data set has three quartiles, which divides the complete dataset into four regions.
- The first quartile is the sample 25th percentile. It divides roughly 25% of the smallest items from the rest of the dataset.
- The second quartile is the sample 50th percentile or the median. Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles. It divides the whole dataset into two equal halves.
- The third quartile is the sample 75th percentile. It divides roughly 25% of the largest items from the rest of the dataset.
Each part has approximately the same number of items. If you want to divide your data into several intervals, you can use statistics.quantiles():
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]statistics.quantiles(x, n=2)statistics.quantiles(x, n=4, method=’inclusive’)
[0.1, 8.0, 21.0]
In this example, 8.0 is the median of x, while 0.1 and 21.0 are the sample 25th and 75th percentiles, respectively. The parameter n defines the number of resulting equal-probability percentiles, and the method determines how to calculate them. You can also use np.percentile() to determine any sample percentile in your dataset. For example, this is how you can find the 5th and 95th percentiles:
y = np.array(x)p.percentile(y, 5)p.percentile(y, 95)
NumPy also offers you very similar functionality in quantile() and nanquantile(). If you use them, then you’ll need to provide the quantile values as the numbers between 0 and 1 instead of percentiles:
np.quantile(y, 0.05)np.quantile(y, 0.95)
pd.Series objects have the method .quantile():
z = pd.Series(y)z.quantile(0.05)z.quantile(0.95)
Range: The range of data is the difference between the maximum and minimum element in the dataset. You can get it with the function np.ptp():
Summary of Descriptive Statistics: SciPy and Pandas offer useful routines to quickly get descriptive statistics with a single function or method call. You can use scipy.stats.describe() like this:
result = scipy.stats.describe(y, ddof=1, bias=False)
DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)Pandas has similar, if not better, functionality. Series objects have the method .describe():
result = z.describe()
Measures of Correlation Between Pairs of Data
You’ll often need to examine the relationship between the corresponding elements of two variables in a dataset. For example, there are two variables, x, and y, with an equal number of elements, n. Let x1 from x correspond to y1 from y, x2 from x to y2 from y, and so on. You can then say that there are n pairs of corresponding elements: (x1, y1), (x2, y2), and so on.
The measure of the correlation between pairs of data can be of the following type:
- Positive correlation exists when the value of y increases if there is an increase in corresponding value of x and vice-versa.
- Positive correlation exists when the value of y decreases if there is an increase in corresponding value of x and vice-versa.
- Zero correlation exists if there is no such apparent relationship.
The two statistics that measure the correlation between datasets are covariance and the correlation coefficient.
Covariance: The sample covariance is a measure that quantifies the strength and direction of a relationship between a pair of variables:
- If the correlation is positive, then the covariance is positive, as well. A stronger relationship corresponds to a higher value of the covariance.
- If the correlation is negative, then the covariance is negative, as well. A stronger relationship corresponds to a lower (or higher absolute) value of the covariance.
- If the correlation is weak, then the covariance is close to zero.
For example, a covariance value between 0.5 and 1 shows a strong positive relationship, a value between -0.5 and -1 shows a strong negative relationship. Similarly, a covariance value between 0 and 0.5 shows a weak positive relationship, any value between 0 and -0.5 shows a weak negative relationship.
The covariance of the variables x and y is mathematically defined as sxy = ∑(xi – mean(x))(yi – mean(y))/(n – 1), where i = 1, 2, 3, …, n, mean(x) is the sample mean of x and mean(y) is the sample mean of y.
This is how you can calculate the covariance in pure Python:
n = len(x)mean_x, mean_y = sum(x) / n, sum(y) / ncov_xy = (sum((x[k] – mean_x) * (y[k] – mean_y) for k in range(n))/(n – 1))
NumPy has the function cov() that returns the covariance matrix:
cov_matrix = np.cov(x_, y_)
You can obtain the same value of the covariance with np.conv() as with pure Python. Pandas Series have the method .conv() that you can use to calculate the covariance:
cov_xy = x__.cov(y__)
Correlation Coefficient: The correlation coefficient, or Pearson product-moment correlation coefficient, is denoted by the symbol r. The coefficient is another measure of the correlation between data. Following important points are to be kept in mind regarding the correlation coefficient:
- The value r > 0 indicates positive correlation.
- The value r < 0 indicates negative correlation.
- The value r = 1 is the maximum possible value of r. It corresponds to a perfect positive linear relationship between the variables.
- The value r = -1 is the minimum possible value of r. It corresponds to a perfect negative linear relationship between the variables.
- The value r = 0, or when r is around zero, means that the correlation between variables is weak.
The mathematical formula for the correlation coefficient is r = sxy /(sxsy), where sx and sy are the standard deviations of x and y respectively and sxy is the covariance between x and y.
var_x = sum((item – mean_x)**2 for item in x) / (n – 1)var_y = sum((item – mean_y)**2 for item in y) / (n – 1)std_x, std_y = var_x ** 0.5, var_y ** 0.5r = cov_xy / (std_x * std_y)
scipy.stats has the routine pearsonr() that calculates the correlation coefficient and the p-value:
r, p = scipy.stats.pearsonr(x_, y_)
pearsonr() returns a tuple with two numbers. The first one is r and the second is the p-value.