Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating a null hypothesis (H0) representing no effect or relationship and an alternative hypothesis (H1) indicating the presence of an effect or relationship. Statistical tests are then applied to determine whether the observed data is significant enough to reject the null hypothesis in favor of the alternative hypothesis. Common tests include z-test, t-tests, chi-square tests, and ANOVA, and they provide p-values to quantify the strength of the evidence against the null hypothesis.
1.1. Assumptions
Random Sampling: The data must be collected randomly from the population.
Independence: Observations should be independent of each other.
Normality: The data should be approximately normally distributed, especially for small sample sizes.
Homogeneity of Variance: For certain tests like ANOVA or t-tests, the variance among the groups being compared should be approximately equal.
1.2. Building Hypothesis Tests
Hypothesis
Two-tailed test
One tailed test
Null H0
Estimate = Value
Estimate ≥ Value (or Estimate < Value)
Alternative H1
Estimate ‡ Value
Estimate < Value ( or Estimate > Value)
1.3. Which one to use? Z-test of T-test?
2. Real Life Examples
2.1. A/B Testing a Landing Page
An online retailer ran an A/B test and obtained the following conversions per control and treatment groups:
conv_control =574# conversions in the control grouptotal_control =5098# total number of users in the control groupconv_treat =628# conversions in the treatment grouptotal_treat =4902# total number of users in the treatment group
Company would like to test the hypothesis that the new design indeed yields to better conversions. Test this hypothesis with 5% significance level (alpha).
Solution:
Since the company wants to test if the new design (treatment) yields more conversions than the old design (control), a one-tailed test is appropriate.
Null Hypothesis (H0): pt ≤ pc
Alternative Hypothesis (H1): pt > pc
Significance level α=0.05
The control and treatment groups typically represent different sets of users or sessions. In typical A/B tests, each user sees only one version of the page (control or treatment), so the samples are independent. There is no pairing or linking between individuals in the control and treatment groups. Therefore, we assume the groups are independent.
where n_c=5098 and n_t=4902
SE = sqrt(0.1057×0.000400) ≈ 0.0065
Test Statistic (Z-score):
Z = (p_t − p_c) / SE = (0.128193−0.112637)/0.0065 ≈ 2.39
P-value (One-Tailed):
The p-value for Z=2.39 (one-tailed) is the area under the standard normal curve to the right of 2.39.From standard normal distribution table, P(Z > 2.39) ≈ 0.0084 (one-tailed)
Conclusion:
Since p-value ≈ 0.0084 < 0.05 (our alpha), we reject the null hypothesis and conclude that the new page (treatment) likely leads to a higher conversion rate than the old (control) page.
Python Solution with Scipy and Statsmodels
# A webpage tests the hypothesis that the new design of the landing page# yields to more conversions. Test this hypothesis with 5% significance # level (alpha).# HO: Treatment ≤ Control# H1: Treatment > Controlconv_control =574# conver in the control groupsionstotal_control =5098# total number of users in the control groupconv_treat =628# conversions in the treatment grouptotal_treat =4902# total number of users in the treatment groupimport numpy as npfrom scipy.stats import normfrom statsmodels.stats.proportion import proportions_ztestconv_control =574total_control =5098conv_treat =628total_treat =4902# Method 1 - using scipyp_c = conv_control / total_controlp_t = conv_treat / total_treatp_pool = (conv_control + conv_treat) / (total_control + total_treat)se = np.sqrt(p_pool * (1- p_pool) * (1/total_control +1/total_treat))Z = (p_t - p_c) / se# One-tailed p-valuep_value_scipy=1- norm.cdf(Z)print(f"Z-stat (scipy): {Z:.4f}")print(f"p-value (scipy): {p_value_scipy:.4f}")# Method 2 - using statsmodelscount = np.array([conv_treat, conv_control])nobs = np.array([total_treat, total_control])stat, pval =proportions_ztest(count, nobs, alternative="larger")print(f"Z-stat (statsmodels): {stat:.4f}")print(f"p-value (statsmodels): {pval:.4f}")"""Z-stat (scipy): 2.3855p-value (scipy): 0.0085Z-stat (statsmodels): 2.3855p-value (statsmodels): 0.0085Conculusion:p-value of .009 < alpha value of .05. Therefore, we reject the null hypothesis and we can conclude that the change actually has positive effect on conversionrate at the 5% significance level (or even 1% significance level)."""
2.2. Price Check: Which one is more expansive?
A large electronics retailer work with numerous laptop suppliers, and two of the major brands are HP and Dell. The store stocks a variety of laptop models from each brand—ranging from budget-friendly options to high-end machines. The management wants to understand whether there’s a consistent price difference between the HP and Dell laptops. Specifically, they ask: "On average, are the Dell laptops priced higher than the HP laptops we stock?" This is important because it might influence:
Inventory decisions (should you stock more of the brand that consistently costs less?),
Marketing and promotions (do you highlight HP for value-seeking customers and Dell for performance-seeking customers?),
Negotiations with suppliers (if the price difference is significant, maybe you can negotiate better deals).
Data Collection:
Over the past month, you randomly select 25 different HP laptop models and record their average in-store price.
Similarly, you randomly select 25 different Dell laptop models and record their average in-store price.
Formulating the Hypotheses:
Null Hypothesis (H0): μHP=μDell (The average prices are equal)
Alternative Hypothesis (H1): μHP≠μDell (The average prices are not equal)
Assumption: Both populations have equal variance! (which affects how we calculated the pooled variance)
SE = s_pooled x sqrt( 1/n_hp + 1/n_dell) = 99.92 * sqrt(1/25 + 1/25)
SE ≈ 28.26
Test statistic (T-score):
t = (x_hp - x_dell) / SE = (984.08−1101.39) / 28.26 ≈ −4.151
P-value (Two-Tailed Test):
We have t = −4.151 with df = 25 + 25 - 2 = 48. For a two-tailed test, the p-value is: p = 2 × P(T > |4.151|) ≈ 0.00013
Conclusion:
The test shows a statistically significant difference (p-value < 0.05), and suggests that Dell laptops are, on average, more expensive than HP laptops.
Python Solution with Scipy and Statsmodels
# Both packages require the list of prices as opposed having only sample means# Therefore we need to create the datasets firstimport numpy as npfrom scipy.stats import ttest_indfrom statsmodels.stats.weightstats import ttest_ind as ttest_ind_sm# Sample data (same as in the manual example, but here we simulate it):# Suppose these are the collected prices for HP and Dell laptops.np.random.seed(21)# For reproducibility# HP laptop prices (25 data points)hp_prices = np.random.normal(loc=1005, scale=95, size=25)# Dell laptop prices (25 data points)dell_prices = np.random.normal(loc=1080, scale=115, size=25)# Print basic infomean_hp = np.mean(hp_prices)mean_dell = np.mean(dell_prices)std_hp = np.std(hp_prices, ddof=1)std_dell = np.std(dell_prices, ddof=1)print("Data Summary:")print(f"HP: mean={mean_hp:.2f}, std={std_hp:.2f}, n={len(hp_prices)}")print(f"Dell: mean={mean_dell:.2f}, std={std_dell:.2f}, n={len(dell_prices)}\n")# Assume equal variances (classic two-sample t-test assumption)# Pooled variance and standard devationsp_squared = ((n_hp -1)*std_hp**2+ (n_dell -1)*std_dell**2) / (n_hp + n_dell -2)sp =sqrt(sp_squared)#print(sp)# pooled standard errorse = (sp *sqrt((1/n_hp) + (1/n_dell)))#print(se)# T statistict_stat = (mean_hp - mean_dell) / se#print(t_stat)# Degrees of freedomdf = n_hp + n_dell -2# Two-tailed p-value from t-distribution# p-value = 2 * P(T > |t_stat|)p_value_manual =2* (1- t.cdf(abs(t_stat), df))print(f"T-statistic (manual): {t_stat:.3f}")print(f"Degrees of freedom: {df}")print(f"P-value (manual, two-tailed): {p_value_manual:.5f}")print()# SciPy Two-Sample T-Test (assuming equal variance by default)t_stat_scipy, p_value_scipy =ttest_ind(hp_prices, dell_prices, equal_var=True)print("SciPy Results:")print(f"T-statistic (SciPy): {t_stat_scipy:.3f}")print(f"P-value (SciPy): {p_value_scipy:.5f}\n")# Statsmodels Two-Sample T-Test# usevar='pooled' assumes equal variances, similar to ttest_ind default in SciPyt_stat_sm, p_value_sm, df_sm =ttest_ind_sm(hp_prices, dell_prices, usevar='pooled', alternative='two-sided')print("Statsmodels Results:")print(f"T-statistic (Statsmodels): {t_stat_sm:.3f}")print(f"P-value (Statsmodels): {p_value_sm:.5f}")'''Data Summary:HP: mean=984.08, std=91.96, n=25Dell: mean=1101.39, std=107.29, n=25T-statistic (manual): -4.151Degrees of freedom: 48P-value (manual, two-tailed): 0.00013SciPy Results:T-statistic (SciPy): -4.151P-value (SciPy): 0.00013Statsmodels Results:T-statistic (Statsmodels): -4.151P-value (Statsmodels): 0.00013'''