Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of domains. Data science is related to data mining, machine learning, and big data.
Data scientists use a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. The statistical methods and techniques play important roles in carrying out these tasks.
Basic Concepts of Statistics Used in Data Science
Statistics is the discipline of analyzing data. As such it intersects heavily with data science and machine learning. Following are the 4 basic concepts of statistics used in data science.
1. Descriptive Statistics
Descriptive Statistics is summarizing the data at hand through certain numbers like mean, median, mode, variance, standard deviation, etc. so as to make the understanding of the data easier. It does not involve any generalization or inference beyond what is available. This means that descriptive statistics are just the representation of the data (sample) available and not based on any theory of probability.
In business, it provides the analyst a view of key metrics and measures (mentioned above) within the business. Descriptive statistics include exploratory data analysis, unsupervised learning, clustering, and basic data summaries. Descriptive statistics usually are the starting point for any analysis. Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.
Descriptive statistics are very important because if we simply present our raw data it would be hard to visualize what the data is showing, especially if there is a lot of it. Descriptive statistics, therefore, enables us to present the data in a more meaningful way.
To understand the role of descriptive statistics, let’s consider the following example. You have the marks obtained by 100,000 students in a particular examination and you may be interested in the overall performance of these students. Descriptive statistics allow us to do this.
The mean of data gives the average score of students. Median and quartiles help in finding the percentile score of students (i.e., where a particular student stands with ), standard deviation and variance show the spread of data, and so on.
2. Inferential Statistics
In inferential statistics, we make an inference from a sample about the population. The main aim of inferential statistics is to draw some conclusions from the sample and generalize them for the population data. E.g., you want to find the average salary of a data analyst across a country. There are two options available to you:
- The first option is to consider the salary of data analysts across the country and take an average of it.
- The second option is to take a sample of the salary of data analysts from major IT cities of a country and take their average and consider that for the whole country.
The first option is not possible as it is very difficult to collect all the data of data analysts across a country. It is time-consuming as well as costly. So, to overcome these issues, we will look into the second option to collect a small sample of salaries of data analysts and take their average as the country’s average. This is the inferential statistics where we make an inference from a sample about the population.
The most common methodologies in inferential statistics are hypothesis tests, confidence intervals, and regression analysis.
Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset. Prediction is the process of trying to guess an outcome given a set of realizations of the outcome and some predictors. Machine learning, regression, deep learning, boosting, random forests, and logistic regression are all prediction algorithms.
Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical model that captures important trends. Then that predictive model is used on current data to predict what will happen next or to suggest actions to take for optimal outcomes.
Predictive analytics has received a lot of attention in recent years due to advances in supporting technology, particularly in the areas of big data and machine learning.
4. Experimental Design
At the heart of every data science project exists the planning, design, and execution of experiments. Such experiments aim at understanding the data, potentially cleaning it, and performing the necessary data analysis for knowledge discovery and decision-making. Without knowing the experimental design processes that are used in practice, researchers may not be able to discover what is really hidden in their data.
Experimental design is the act of controlling your experimental process to optimize the chance of arriving at sound conclusions. The most notable example of experimental design is randomization. In randomization, a treatment is randomized across experimental units to make treatment groups as comparable as possible. Clinical trials are the best example that employs randomization.
In random sampling, one tries to randomly sample from a population of interest to get better generalizability of the results to the population.