Hypothesis Testing

1. What is Hypothesis Testing?

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating a null hypothesis (H0) representing no effect or relationship and an alternative hypothesis (H1) indicating the presence of an effect or relationship. Statistical tests are then applied to determine whether the observed data is significant enough to reject the null hypothesis in favor of the alternative hypothesis. Common tests include z-test, t-tests, chi-square tests, and ANOVA, and they provide p-values to quantify the strength of the evidence against the null hypothesis.

1.1. Assumptions

Random Sampling: The data must be collected randomly from the population.
Independence: Observations should be independent of each other.
Normality: The data should be approximately normally distributed, especially for small sample sizes.
Homogeneity of Variance (Constant variance): For certain tests like ANOVA or t-tests, the variance among the groups being compared should be approximately equal.

1.2. Building Hypothesis Tests

Hypothesis

Two-tailed test

One tailed test

Null H0

Estimate = Value

Estimate ≥ Value (or Estimate < Value)

Alternative H1

Estimate ‡ Value

Estimate < Value ( or Estimate > Value)

When we conduct our test, we will get a test statistic, such as a z score or t statistic. With this statistic, we can calculate the p-value, which indicates the probability of obtaining our sample results, assuming the null hypothesis is true. If the p-value is small enough, falling in the blue (above image), then we reject the null hypothesis.

We use the significance level (alpha) to determine how large of an effect we need to reject the null hypothesis, or how certain we need to be. A common alpha value is 0.05, which represents 95% confidence in our test.

1.3. Which one to use? Z-test of T-test?

When we get our outcome, there will always be a probability of obtaining false results; this is what our significance level and power are for. There are two types of errors that we can get. Let's look at a confusion matrix for more on this, with our predictions on the y-axis. Type I errors or false positives, shown in the top right, occur when you incorrectly reject a true null hypothesis. Type II errors or false negatives, shown in the bottom left, occur when you accept a null hypothesis when an effect really exists. This means that we predicted no effect when there really was an effect.

Image below shows the relationship of the parameters we have talked so far for one-tailed hypothesis testing:

Note that to minimize false positives (FP) and false negatives (FN) while maximizing true positives (TP), we face trade-offs similar to balancing recall and precision in a Classification problem in Machine Learning. As alpha (α) decreases, beta (β) increases, reducing statistical power. Factors influencing α and β include sample size, spread of distribution and the difference between assumption and observation.

Sample size and confidence level are negatively correlated with Type II error, while minimum effect size causes a higher chance of Type II error.

2. Real Life Examples

2.1. A/B Testing a Landing Page

An online retailer ran an A/B test and obtained the following conversions per control and treatment groups:

conv_control = 574     # conversions in the control group
total_control = 5098   # total number of users in the control group
conv_treat = 628       # conversions in the treatment group
total_treat = 4902     # total number of users in the treatment group

Company would like to test the hypothesis that the new design indeed yields to better conversions. Test this hypothesis with 5% significance level (alpha).

Solution:

Since the company wants to test if the new design (treatment) yields more conversions than the old design (control), a one-tailed test is appropriate.

Null Hypothesis (H0): p_t ≤ p_c
Alternative Hypothesis (H1): p_t > p_c
- Significance level α=0.05

The control and treatment groups typically represent different sets of users or sessions. In typical A/B tests, each user sees only one version of the page (control or treatment), so the samples are independent. There is no pairing or linking between individuals in the control and treatment groups. Therefore, we assume the groups are independent.

Manual Calculation

Calculate Sample Proportions:

p_c = 574 / 5098 ≈ 0.112637
p_t = 628 / 4902 ≈ 0.128193

Pooled Proportion:

p_pool = (574+628) / (5098+4902) = 1202 / 10000 = 0.1202

Standard Error:

where n_c=5098 and n_t=4902
SE = sqrt(0.1057×0.000400) ≈ 0.0065

Test Statistic (Z-score):

Z = (p_t − p_c) / SE = (0.128193−0.112637)/0.0065 ≈ 2.39

P-value (One-Tailed):

The p-value for Z=2.39 (one-tailed) is the area under the standard normal curve to the right of 2.39. From standard normal distribution table, P(Z > 2.39) ≈ 0.0084 (one-tailed)

Conclusion:

Since p-value ≈ 0.0084 < 0.05 (our alpha), we reject the null hypothesis and conclude that the new page (treatment) likely leads to a higher conversion rate than the old (control) page.

Python Solution with Scipy and Statsmodels


# A webpage tests the hypothesis that the new design of the landing page
# yields to more conversions. Test this hypothesis with 5% significance 
# level (alpha).
# HO: Treatment ≤ Control
# H1: Treatment > Control
conv_control = 574     # conver in the control groupsions
total_control = 5098   # total number of users in the control group
conv_treat = 628       # conversions in the treatment group
total_treat = 4902     # total number of users in the treatment group


import numpy as np
from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest

conv_control = 574
total_control = 5098
conv_treat = 628
total_treat = 4902


# Method 1 - using scipy
p_c = conv_control / total_control
p_t = conv_treat / total_treat

p_pool = (conv_control + conv_treat) / (total_control + total_treat)
se = np.sqrt(p_pool * (1 - p_pool) * (1/total_control + 1/total_treat))

Z = (p_t - p_c) / se

# One-tailed p-value
p_value_scipy = 1 - norm.cdf(Z)

print(f"Z-stat (scipy): {Z:.4f}")
print(f"p-value (scipy): {p_value_scipy:.4f}")


# Method 2 - using statsmodels
count = np.array([conv_treat, conv_control])
nobs = np.array([total_treat, total_control])

stat, pval = proportions_ztest(count, nobs, alternative="larger")

print(f"Z-stat (statsmodels): {stat:.4f}")
print(f"p-value (statsmodels): {pval:.4f}")


"""
Z-stat (scipy): 2.3855
p-value (scipy): 0.0085
Z-stat (statsmodels): 2.3855
p-value (statsmodels): 0.0085



Conculusion:
p-value of .0085 < alpha value of .05. Therefore, we reject the null hypothesis 
and we can conclude that the change actually has positive effect on conversion
rate at the 5% significance level (or even 1% significance level).
"""

2.2. Price Check: Which one is more expansive?

A large electronics retailer work with numerous laptop suppliers, and two of the major brands are HP and Dell. The store stocks a variety of laptop models from each brand—ranging from budget-friendly options to high-end machines. The management wants to understand whether there’s a consistent price difference between the HP and Dell laptops. Specifically, they ask: "On average, are the Dell laptops priced higher than the HP laptops we stock?" This is important because it might influence:

Inventory decisions (should you stock more of the brand that consistently costs less?),
Marketing and promotions (do you highlight HP for value-seeking customers and Dell for performance-seeking customers?),
Negotiations with suppliers (if the price difference is significant, maybe you can negotiate better deals).

Data Collection:

Over the past month, you randomly select 25 different HP laptop models and record their average in-store price.
Similarly, you randomly select 25 different Dell laptop models and record their average in-store price.

Formulating the Hypotheses:

Null Hypothesis (H0): μHP=μDell (The average prices are equal)
Alternative Hypothesis (H1): μHP≠μDell (The average prices are not equal)
Assumption: Both populations have equal variance! (which affects how we calculated the pooled variance)

Manual Calculation

Sample Stats
- For the HP laptop group:
  - Sample mean x_hp = $984.08
  - Sample standard deviation s_hp = $91.96
    Sample variance s_hp^2 = 8,457
  - Sample Size n_hp = 25
- For the Dell laptop group:
  - Sample mean x_dell = $1101.39
  - Sample standard deviation s_dell = $107.29
    Sample variance s_dell^2 = 11,511
  - Sample Size n_dell = 25

Pooled Standard Deviation:

s_pooled = sqrt ( [(25−1)⋅8457+(25−1)⋅11511] / (25+25-2))
s_pooled ≈ 99.92

Standard Error of the Difference in Means

SE = s_pooled x sqrt( 1/n_hp + 1/n_dell) = 99.92 * sqrt(1/25 + 1/25) 
SE ≈ 28.26

Test statistic (T-score):

t = (x_hp - x_dell) / SE = (984.08−1101.39) / 28.26 ≈ −4.151

P-value (Two-Tailed Test):

We have t = −4.151 with df = 25 + 25 - 2 = 48. For a two-tailed test, the p-value is: p = 2 × P(T > |4.151|) ≈ 0.00013

Conclusion:

The test shows a statistically significant difference (p-value < 0.05), and suggests that Dell laptops are, on average, more expensive than HP laptops.

Python Solution with Scipy and Statsmodels

# Both packages require the list of prices as opposed having only sample means
# Therefore we need to create the datasets first

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.weightstats import ttest_ind as ttest_ind_sm

# Sample data (same as in the manual example, but here we simulate it):
# Suppose these are the collected prices for HP and Dell laptops.
np.random.seed(21)  # For reproducibility

# HP laptop prices (25 data points)
hp_prices = np.random.normal(loc=1005, scale=95, size=25)

# Dell laptop prices (25 data points)
dell_prices = np.random.normal(loc=1080, scale=115, size=25)

# Print basic info
mean_hp = np.mean(hp_prices)
mean_dell = np.mean(dell_prices)
std_hp = np.std(hp_prices, ddof=1)
std_dell = np.std(dell_prices, ddof=1)

print("Data Summary:")
print(f"HP: mean={mean_hp:.2f}, std={std_hp:.2f}, n={len(hp_prices)}")
print(f"Dell: mean={mean_dell:.2f}, std={std_dell:.2f}, n={len(dell_prices)}\n")



# Assume equal variances (classic two-sample t-test assumption)
# Pooled variance and standard devation
sp_squared = ((n_hp - 1)*std_hp**2 + (n_dell - 1)*std_dell**2) / (n_hp + n_dell - 2)
sp = sqrt(sp_squared)
#print(sp)

# pooled standard error
se = (sp * sqrt((1/n_hp) + (1/n_dell)))
#print(se)

# T statistic
t_stat = (mean_hp - mean_dell) / se
#print(t_stat)

# Degrees of freedom
df = n_hp + n_dell - 2

# Two-tailed p-value from t-distribution
# p-value = 2 * P(T > |t_stat|)
p_value_manual = 2 * (1 - t.cdf(abs(t_stat), df))

print(f"T-statistic (manual): {t_stat:.3f}")
print(f"Degrees of freedom: {df}")
print(f"P-value (manual, two-tailed): {p_value_manual:.5f}")
print()

# SciPy Two-Sample T-Test (assuming equal variance by default)
t_stat_scipy, p_value_scipy = ttest_ind(hp_prices, dell_prices, equal_var=True)

print("SciPy Results:")
print(f"T-statistic (SciPy): {t_stat_scipy:.3f}")
print(f"P-value (SciPy): {p_value_scipy:.5f}\n")


# Statsmodels Two-Sample T-Test
# usevar='pooled' assumes equal variances, similar to ttest_ind default in SciPy
t_stat_sm, p_value_sm, df_sm = ttest_ind_sm(hp_prices, dell_prices, usevar='pooled', alternative='two-sided')
print("Statsmodels Results:")
print(f"T-statistic (Statsmodels): {t_stat_sm:.3f}")
print(f"P-value (Statsmodels): {p_value_sm:.5f}")


'''
Data Summary:
HP: mean=984.08, std=91.96, n=25
Dell: mean=1101.39, std=107.29, n=25

T-statistic (manual): -4.151
Degrees of freedom: 48
P-value (manual, two-tailed): 0.00013

SciPy Results:
T-statistic (SciPy): -4.151
P-value (SciPy): 0.00013

Statsmodels Results:
T-statistic (Statsmodels): -4.151
P-value (Statsmodels): 0.00013
'''

Last updated 7 months ago

Was this helpful?