• Home
  • /
  • Blog
  • /
  • Hypothesis Testing in ML – Explained to Kids

Hypothesis Testing in ML – Explained to Kids

Hypothesis Testing in ML

This post is also available in: हिन्दी (Hindi) العربية (Arabic)

The machine learning practitioner has a tradition of algorithms and a pragmatic focus on results and model skills above other concerns such as model interpretability.

Statisticians work on much the same type of modeling problems under the names of applied statistics and statistical learning. Coming from a mathematical background, they have more of a focus on the behavior of models and the explainability of predictions.

The very close relationship between the two approaches to the same problem means that both fields have a lot to learn from each other. ML models use many statistical concepts and one such is Hypothesis Testing.

Hypothesis Testing in ML

Machine learning models are chosen based on their mean performance, often calculated using k-fold cross-validation.

The algorithm with the best mean performance is expected to be better than those algorithms with worse mean performance. But what if the difference in the mean performance is caused by a statistical fluke?

The solution is to use a statistical hypothesis test to evaluate whether the difference in the mean performance between any two algorithms is real or not.

Machine learning models are chosen based on their mean performance, often calculated using k-fold cross-validation.

Model selection involves evaluating a suite of different machine learning algorithms or modeling pipelines and comparing them based on their performance.

The model or modeling pipeline that achieves the best performance according to your performance metric is then selected as the final model that you can then use to start making predictions on new data.

This applies to regression and classification predictive modeling tasks with classical machine learning algorithms and deep learning. It’s always the same process.

The problem is, how do you know the difference between two models is real and not just a statistical fluke?

This problem can be addressed using a statistical hypothesis test.

What is Hypothesis Testing?

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.

First, a tentative assumption is made about the parameter or distribution. This assumption is called the Null Hypothesis and is denoted by H0. An Alternative Hypothesis (denoted Ha), which is the opposite of what is stated in the null hypothesis, is then defined.

The hypothesis-testing procedure involves using sample data to determine whether or not H0 can be rejected. If H0 is rejected, the statistical conclusion is that the alternative hypothesis Ha is true.

Hypothesis Testing in ML

For example, assume that a radio station selects the music it plays based on the assumption that the average age of its listening audience is 30 years. To determine whether this assumption is valid, a hypothesis test could be conducted with the null hypothesis given as H0: μ = 30 and the alternative hypothesis given as Ha: μ ≠ 30.

Based on a sample of individuals from the listening audience, the sample mean age, , can be computed and used to determine whether there is sufficient statistical evidence to reject H0. Conceptually, a value of the sample mean that is “close” to 30 is consistent with the null hypothesis, while a value of the sample mean that is “not close” to 30 provides support for the alternative hypothesis. What is considered “close” and “not close” is determined by using the sampling distribution of .

Process of Conducting Hypothesis Testing

When you are evaluating a hypothesis, you need to account for both the variability in your sample and how large your sample is.  Based on this information, you’d like to make an assessment of whether any differences you see are meaningful, or if they are likely just due to chance.  This is formally done through a process called hypothesis testing.

Five steps in Hypothesis Testing are:

  • Specify the Null Hypothesis
  • Specify the Alternative Hypothesis
  • Set the Significance Level
  • Calculate the Test Statistic and Corresponding p-Value
  • Drawing a Conclusion

Step1: Specify the Null Hypothesis

The null hypothesis (H0) is a statement of no effect, relationship, or difference between two or more groups or factors.  In research studies, a researcher is usually interested in disproving the null hypothesis.

Examples:

  • There is no difference in intubation rates across ages 0 to 5 years.
  • The intervention and control groups have the same survival rate (or, the intervention does not improve survival rate).
  • There is no association between injury type and whether or not the patient received an IV in the prehospital setting.

Step2: Specify the Alternative Hypothesis

The alternative hypothesis (Ha) is the statement that there is an effect or difference.  This is usually the hypothesis the researcher is interested in proving.  The alternative hypothesis can be one-sided (only provides one direction, e.g., lower) or two-sided.  We often use two-sided tests even when our true hypothesis is one-sided because it requires more evidence against the null hypothesis to accept the alternative hypothesis.

Examples:

  • The intubation success rate differs with the age of the patient being treated (two-sided).
  • The time to resuscitation from cardiac arrest is lower for the intervention group than for the control (one-sided).
  • There is an association between injury type and whether or not the patient received an IV in the prehospital setting (two-sided).

Step 3: Set the Significance Level (α)

The significance level (denoted by the Greek letter alpha— α) is generally set at 0.05.  This means that there is a 5% chance that you will accept your alternative hypothesis when your null hypothesis is actually true. The smaller the significance level, the greater the burden of proof needed to reject the null hypothesis, or in other words, to support the alternative hypothesis.

Step 4: Calculate the Test Statistic and Corresponding p-Value

Hypothesis testing generally uses a test statistic that compares groups or examines associations between variables.  When describing a single sample without establishing relationships between variables, a confidence interval is commonly used.

The p-value describes the probability of obtaining a sample statistic as or more extreme by chance alone if your null hypothesis is true.  This p-value is determined based on the result of your test statistic.  Your conclusions about the hypothesis are based on your p-value and your significance level. 

Example:

  • p-value = 0.01 This will happen 1 in 100 times by pure chance if your null hypothesis is true. Not likely to happen strictly by chance.

Example:

  • p-value = 0.75 This will happen 75 in 100 times by pure chance if your null hypothesis is true. Very likely to occur strictly by chance.

If you do a large number of tests to evaluate a hypothesis (called multiple testing), then you need to control for this in your designation of the significance level or calculation of the p-value.  For example, if three outcomes measure the effectiveness of a drug or other intervention, you will have to adjust for these three analyses.

Step 5: Drawing a Conclusion

  1. p-value <= significance level (α) => Reject your null hypothesis in favor of your alternative hypothesis.  Your result is statistically significant.
  2. p-value > significance level (α) => Fail to reject your null hypothesis.  Your result is not statistically significant.

Hypothesis testing is not set up so that you can absolutely prove a null hypothesis.  Therefore, when you do not find evidence against the null hypothesis, you fail to reject the null hypothesis. When you do find strong enough evidence against the null hypothesis, you reject the null hypothesis.  Your conclusions also translate into a statement about your alternative hypothesis.  When presenting the results of a hypothesis test, include the descriptive statistics in your conclusions as well.  Report exact p-values rather than a certain range.  For example, “The intubation rate differed significantly by patient age with younger patients have a lower rate of successful intubation (p=0.02).”  Here are two more examples with the conclusion stated in several different ways.

Hypothesis Testing in ML

Example:

  • H0: There is no difference in survival between the intervention and control groups.
  • Ha: There is a difference in survival between the intervention and control groups.
  • α = 0.05; 20% increase in survival for the intervention group; p-value = 0.002

Conclusion:

  • Reject the null hypothesis in favor of the alternative hypothesis.
  • The difference in survival between the intervention and control groups was statistically significant.
  • There was a 20% increase in survival for the intervention group compared to the control (p=0.001).

Example:

  • H0: There is no difference in survival between the intervention and control groups.
  • Ha: There is a difference in survival between the intervention and control groups.
  • α = 0.05; 5% increase in survival between the intervention and control group; p-value = 0.20.

Conclusion:

  • Fail to reject the null hypothesis.
  • The difference in survival between the intervention and control groups was not statistically significant.
  • There was no significant increase in survival for the intervention group compared to the control (p=0.20).

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
>