Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of domains. Data science is related to data mining, machine learning, and big data.
Data scientists use a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns in the raw data. Statistical methods and techniques play important roles in carrying out these tasks.
Let’s understand the basic statistics for data science and the different concepts used in data science.
Basic Statistics for Data Science
Statistics is the discipline of analyzing data. As such it intersects heavily with data science and machine learning. Following are the 4 basic concepts of statistics used in data science.
1. Descriptive Statistics
Descriptive Statistics is summarizing the data at hand through certain numbers like mean, median, mode, variance, standard deviation, etc. so as to make understanding the data easier. It does not involve any generalization or inference beyond what is available. This means that descriptive statistics are just the representation of the data (sample) available and are not based on any theory of probability.
In business, it provides the analyst with a view of key metrics and measures (mentioned above) within the business. Descriptive statistics include exploratory data analysis, unsupervised learning, clustering, and basic data summaries. Descriptive statistics usually are the starting point for any analysis. Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.
Descriptive statistics are very important because if we simply present our raw data it would be hard to visualize what the data is showing, especially if there is a lot of it. Descriptive statistics, therefore, enables us to present the data in a more meaningful way.
To understand the role of descriptive statistics, let’s consider the following example. You have the marks obtained by 100,000 students in a particular examination and you may be interested in the overall performance of these students. Descriptive statistics allow us to do this.
The mean of data gives the average score of students. Median and quartiles help in finding the percentile score of students (i.e., where a particular student stands), standard deviation and variance show the spread of data, and so on.
2. Inferential Statistics
In inferential statistics, we make an inference from a sample about the population. The main aim of inferential statistics is to draw some conclusions from the sample and generalize them for the population data. E.g., you want to find the average salary of a data analyst across a country. There are two options available to you:
- The first option is to consider the salary of data analysts across the country and take an average of it.
- The second option is to take a sample of the salary of data analysts from major IT cities of a country and take their average and consider that for the whole country.
The first option is not possible as it is very difficult to collect all the data of data analysts across the country. It is time-consuming as well as costly. So, to overcome these issues, we will look into the second option to collect a small sample of the salaries of data analysts and take their average as the country’s average. This is inferential statistics where we make an inference from a sample about the population.
The most common methodologies in inferential statistics are hypothesis tests, confidence intervals, and regression analysis.
Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset. Prediction is the process of trying to guess an outcome given a set of realizations of the outcome and some predictors. Machine learning, regression, deep learning, boosting, random forests, and logistic regression are all prediction algorithms.
Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical model that captures important trends. Then that predictive model is used on current data to predict what will happen next or to suggest actions to take for optimal outcomes.
Predictive analytics has received a lot of attention in recent years due to advances in supporting technology, particularly in the areas of big data and machine learning.
4. Experimental Design
At the heart of every data science project exists the planning, design, and execution of experiments. Such experiments aim at understanding the data, potentially cleaning it, and performing the necessary data analysis for knowledge discovery and decision-making. Without knowing the experimental design processes that are used in practice, researchers may not be able to discover what is really hidden in their data.
Experimental design is the act of controlling your experimental process to optimize the chance of arriving at sound conclusions. The most notable example of experimental design is randomization. In randomization, a treatment is randomized across experimental units to make treatment groups as comparable as possible. Clinical trials are the best example that employs randomization.
In random sampling, one tries to randomly sample from a population of interest to get better generalizability of the results to the population.
- What are the four basic concepts of statistics used in Data Science?
- What is Descriptive Statistics?
- What is Inferential Statistics?
- What is the difference between Descriptive & Inferential Statistics?
- Mean, Median, Mode, Quartiles and Standard Deviation comes under
- Descriptive Statistics
- Inferential Statistics
- Hypothesis Tests, Confidence Intervals, and Regression Analysis comes under
- Descriptive Statistics
- Inferential Statistics
- What is meant by Probability?
- What is meant by Prediction in Statistics?
Why statistics is useful for data science?
Advanced machine learning algorithms in data science utilize statistics to identify and convert data patterns into usable evidence. Data scientists use statistics to collect, evaluate, analyze, and draw conclusions from data, as well as to implement quantitative mathematical models for pertinent variables.
What type of statistics is used in data science?
Two types of statistics are used in data science.
1. Descriptive Statistics is summarizing the data at hand through certain numbers like mean, median, mode, variance, standard deviation, etc. so as to make understanding the data easier. It does not involve any generalization or inference beyond what is available.
2. Inferential statistics, where we make an inference from a sample about the population. The main aim of inferential statistics is to draw some conclusions from the sample and generalize them for the population data.
What topics in statistics are needed for data science?
Data analysis requires descriptive statistics and probability theory, at a minimum. These concepts will help you make better business decisions from data. Key concepts include probability distributions, statistical significance, hypothesis testing, and regression.
Data science is basically a study of analyzing data for actionable insights and Statistics is the discipline of analyzing data. There is a direct relationship between data science and statistics. Data science uses the four basic concepts of statistics that are descriptive statistics, inferential statistics, prediction, and experimental design.
- Interesting Facts About Data Science
- Data Collection & Organization(Methods, Tools, Types & Techniques)
- Discrete and Continuous Data(Meaning, Differences & Examples)
- Data Visualization Types and Uses – Tutorial For Beginners
- 6 Steps in Data Preparation for Better ML Model
- 6 Basic Steps in Data Wrangling Explained to Kids