Hypothesis Testing
Last updated
Was this helpful?
Last updated
Was this helpful?
Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating a null hypothesis (H0) representing no effect or relationship and an alternative hypothesis (H1) indicating the presence of an effect or relationship. Statistical tests are then applied to determine whether the observed data is significant enough to reject the null hypothesis in favor of the alternative hypothesis. Common tests include z-test, t-tests, chi-square tests, and ANOVA, and they provide p-values to quantify the strength of the evidence against the null hypothesis.
Random Sampling: The data must be collected randomly from the population.
Independence: Observations should be independent of each other.
Normality: The data should be approximately normally distributed, especially for small sample sizes.
Homogeneity of Variance (Constant variance): For certain tests like ANOVA or t-tests, the variance among the groups being compared should be approximately equal.
Null H0
Estimate = Value
Estimate ≥ Value (or Estimate < Value)
Alternative H1
Estimate ‡ Value
Estimate < Value ( or Estimate > Value)
When we conduct our test, we will get a test statistic, such as a z score or t statistic. With this statistic, we can calculate the p-value
, which indicates the probability of obtaining our sample results, assuming the null hypothesis is true. If the p-value is small enough, falling in the blue (above image), then we reject the null hypothesis.
We use the significance level (alpha)
to determine how large of an effect we need to reject the null hypothesis, or how certain we need to be. A common alpha value is 0.05, which represents 95% confidence in our test.
When we get our outcome, there will always be a probability of obtaining false results; this is what our significance level
and power
are for. There are two types of errors that we can get. Let's look at a confusion matrix for more on this, with our predictions on the y-axis. Type I errors or false positives
, shown in the top right, occur when you incorrectly reject a true null hypothesis. Type II errors or false negatives
, shown in the bottom left, occur when you accept a null hypothesis when an effect really exists. This means that we predicted no effect when there really was an effect.
Image below shows the relationship of the parameters we have talked so far for one-tailed hypothesis testing:
Note that to minimize false positives (FP) and false negatives (FN) while maximizing true positives (TP), we face trade-offs similar to balancing recall and precision in a Classification problem in Machine Learning. As alpha (α) decreases, beta (β) increases, reducing statistical power. Factors influencing α and β include sample size, spread of distribution and the difference between assumption and observation.
Sample size and confidence level are negatively correlated with Type II error, while minimum effect size causes a higher chance of Type II error.
An online retailer ran an A/B test and obtained the following conversions per control and treatment groups:
Company would like to test the hypothesis that the new design indeed yields to better conversions. Test this hypothesis with 5% significance level (alpha).
Since the company wants to test if the new design (treatment) yields more conversions than the old design (control), a one-tailed test
is appropriate.
Null Hypothesis (H0): p_t ≤ p_c
Alternative Hypothesis (H1): p_t > p_c
Significance level α=0.05
The control and treatment groups typically represent different sets of users or sessions. In typical A/B tests, each user sees only one version of the page (control or treatment), so the samples are independent. There is no pairing or linking between individuals in the control and treatment groups. Therefore, we assume the groups are independent.
Calculate Sample Proportions:
Pooled Proportion:
Standard Error:
Test Statistic (Z-score):
P-value (One-Tailed):
The p-value for Z=2.39 (one-tailed) is the area under the standard normal curve to the right of 2.39. From standard normal distribution table, P(Z > 2.39) ≈ 0.0084 (one-tailed)
Conclusion:
Since p-value ≈ 0.0084 < 0.05 (our alpha), we reject the null hypothesis and conclude that the new page (treatment) likely leads to a higher conversion rate than the old (control) page.
A large electronics retailer work with numerous laptop suppliers, and two of the major brands are HP and Dell. The store stocks a variety of laptop models from each brand—ranging from budget-friendly options to high-end machines. The management wants to understand whether there’s a consistent price difference between the HP and Dell laptops. Specifically, they ask: "On average, are the Dell laptops priced higher than the HP laptops we stock?" This is important because it might influence:
Inventory decisions (should you stock more of the brand that consistently costs less?),
Marketing and promotions (do you highlight HP for value-seeking customers and Dell for performance-seeking customers?),
Negotiations with suppliers (if the price difference is significant, maybe you can negotiate better deals).
Data Collection:
Over the past month, you randomly select 25 different HP laptop models and record their average in-store price.
Similarly, you randomly select 25 different Dell laptop models and record their average in-store price.
Formulating the Hypotheses:
Null Hypothesis (H0): μHP=μDell (The average prices are equal)
Alternative Hypothesis (H1): μHP≠μDell (The average prices are not equal)
Assumption: Both populations have equal variance! (which affects how we calculated the pooled variance)
Sample Stats
For the HP laptop group:
Sample mean x_hp = $984.08
Sample standard deviation s_hp = $91.96
Sample variance s_hp^2 = 8,457
Sample Size n_hp = 25
For the Dell laptop group:
Sample mean x_dell = $1101.39
Sample standard deviation s_dell = $107.29
Sample variance s_dell^2 = 11,511
Sample Size n_dell = 25
Pooled Standard Deviation:
Standard Error of the Difference in Means
Test statistic (T-score):
P-value (Two-Tailed Test):
We have t = −4.151 with df = 25 + 25 - 2 = 48. For a two-tailed test, the p-value is: p = 2 × P(T > |4.151|) ≈ 0.00013
Conclusion:
The test shows a statistically significant difference (p-value < 0.05), and suggests that Dell laptops are, on average, more expensive than HP laptops.