Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • 1. What is Hypothesis Testing?
  • 1.1. Assumptions
  • 1.2. Building Hypothesis Tests
  • 1.3. Which one to use? Z-test of T-test?
  • 2. Real Life Examples
  • 2.1. A/B Testing a Landing Page
  • Manual Calculation
  • 2.2. Price Check: Which one is more expansive?
  • Manual Calculation

Was this helpful?

  1. STATISTICS
  2. Fundamentals

Hypothesis Testing

Last updated 4 months ago

Was this helpful?

1. What is Hypothesis Testing?

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating a null hypothesis (H0) representing no effect or relationship and an alternative hypothesis (H1) indicating the presence of an effect or relationship. Statistical tests are then applied to determine whether the observed data is significant enough to reject the null hypothesis in favor of the alternative hypothesis. Common tests include z-test, t-tests, chi-square tests, and ANOVA, and they provide p-values to quantify the strength of the evidence against the null hypothesis.

1.1. Assumptions

  • Random Sampling: The data must be collected randomly from the population.

  • Independence: Observations should be independent of each other.

  • Normality: The data should be approximately normally distributed, especially for small sample sizes.

  • Homogeneity of Variance (Constant variance): For certain tests like ANOVA or t-tests, the variance among the groups being compared should be approximately equal.

1.2. Building Hypothesis Tests

Hypothesis
Two-tailed test
One tailed test

Null H0

Estimate = Value

Estimate ≥ Value (or Estimate < Value)

Alternative H1

Estimate ‡ Value

Estimate < Value ( or Estimate > Value)

When we conduct our test, we will get a test statistic, such as a z score or t statistic. With this statistic, we can calculate the p-value, which indicates the probability of obtaining our sample results, assuming the null hypothesis is true. If the p-value is small enough, falling in the blue (above image), then we reject the null hypothesis.

We use the significance level (alpha) to determine how large of an effect we need to reject the null hypothesis, or how certain we need to be. A common alpha value is 0.05, which represents 95% confidence in our test.

1.3. Which one to use? Z-test of T-test?

When we get our outcome, there will always be a probability of obtaining false results; this is what our significance level and power are for. There are two types of errors that we can get. Let's look at a confusion matrix for more on this, with our predictions on the y-axis. Type I errors or false positives, shown in the top right, occur when you incorrectly reject a true null hypothesis. Type II errors or false negatives, shown in the bottom left, occur when you accept a null hypothesis when an effect really exists. This means that we predicted no effect when there really was an effect.

Image below shows the relationship of the parameters we have talked so far for one-tailed hypothesis testing:

Sample size and confidence level are negatively correlated with Type II error, while minimum effect size causes a higher chance of Type II error.

2. Real Life Examples

2.1. A/B Testing a Landing Page

An online retailer ran an A/B test and obtained the following conversions per control and treatment groups:

conv_control = 574     # conversions in the control group
total_control = 5098   # total number of users in the control group
conv_treat = 628       # conversions in the treatment group
total_treat = 4902     # total number of users in the treatment group

Company would like to test the hypothesis that the new design indeed yields to better conversions. Test this hypothesis with 5% significance level (alpha).

Solution:

Since the company wants to test if the new design (treatment) yields more conversions than the old design (control), a one-tailed test is appropriate.

  • Null Hypothesis (H0): p_t ≤ p_c

  • Alternative Hypothesis (H1): p_t​ > p_c​

    • Significance level α=0.05

The control and treatment groups typically represent different sets of users or sessions. In typical A/B tests, each user sees only one version of the page (control or treatment), so the samples are independent. There is no pairing or linking between individuals in the control and treatment groups. Therefore, we assume the groups are independent.

Manual Calculation

  1. Calculate Sample Proportions:

p_c = 574 / 5098 ≈ 0.112637
p_t = 628 / 4902 ≈ 0.128193
  1. Pooled Proportion:

p_pool = (574+628) / (5098+4902) = 1202 / 10000 = 0.1202
  1. Standard Error:

where n_c=5098 and n_t=4902
SE = sqrt(0.1057×0.000400) ≈ 0.0065
  1. Test Statistic (Z-score):

Z = (p_t − p_c) / SE = (0.128193−0.112637)/0.0065 ≈ 2.39
  1. P-value (One-Tailed):

Conclusion:

Since p-value ≈ 0.0084 < 0.05 (our alpha), we reject the null hypothesis and conclude that the new page (treatment) likely leads to a higher conversion rate than the old (control) page.

Python Solution with Scipy and Statsmodels

# A webpage tests the hypothesis that the new design of the landing page
# yields to more conversions. Test this hypothesis with 5% significance 
# level (alpha).
# HO: Treatment ≤ Control
# H1: Treatment > Control
conv_control = 574     # conver in the control groupsions
total_control = 5098   # total number of users in the control group
conv_treat = 628       # conversions in the treatment group
total_treat = 4902     # total number of users in the treatment group


import numpy as np
from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest

conv_control = 574
total_control = 5098
conv_treat = 628
total_treat = 4902


# Method 1 - using scipy
p_c = conv_control / total_control
p_t = conv_treat / total_treat

p_pool = (conv_control + conv_treat) / (total_control + total_treat)
se = np.sqrt(p_pool * (1 - p_pool) * (1/total_control + 1/total_treat))

Z = (p_t - p_c) / se

# One-tailed p-value
p_value_scipy = 1 - norm.cdf(Z)

print(f"Z-stat (scipy): {Z:.4f}")
print(f"p-value (scipy): {p_value_scipy:.4f}")


# Method 2 - using statsmodels
count = np.array([conv_treat, conv_control])
nobs = np.array([total_treat, total_control])

stat, pval = proportions_ztest(count, nobs, alternative="larger")

print(f"Z-stat (statsmodels): {stat:.4f}")
print(f"p-value (statsmodels): {pval:.4f}")


"""
Z-stat (scipy): 2.3855
p-value (scipy): 0.0085
Z-stat (statsmodels): 2.3855
p-value (statsmodels): 0.0085



Conculusion:
p-value of .0085 < alpha value of .05. Therefore, we reject the null hypothesis 
and we can conclude that the change actually has positive effect on conversion
rate at the 5% significance level (or even 1% significance level).
"""

2.2. Price Check: Which one is more expansive?

A large electronics retailer work with numerous laptop suppliers, and two of the major brands are HP and Dell. The store stocks a variety of laptop models from each brand—ranging from budget-friendly options to high-end machines. The management wants to understand whether there’s a consistent price difference between the HP and Dell laptops. Specifically, they ask: "On average, are the Dell laptops priced higher than the HP laptops we stock?" This is important because it might influence:

  • Inventory decisions (should you stock more of the brand that consistently costs less?),

  • Marketing and promotions (do you highlight HP for value-seeking customers and Dell for performance-seeking customers?),

  • Negotiations with suppliers (if the price difference is significant, maybe you can negotiate better deals).

Data Collection:

  • Over the past month, you randomly select 25 different HP laptop models and record their average in-store price.

  • Similarly, you randomly select 25 different Dell laptop models and record their average in-store price.

Formulating the Hypotheses:

  • Null Hypothesis (H0): μHP=μDell (The average prices are equal)

  • Alternative Hypothesis (H1): μHP≠μDell (The average prices are not equal)

  • Assumption: Both populations have equal variance! (which affects how we calculated the pooled variance)

Manual Calculation

  1. Sample Stats

    • For the HP laptop group:

      • Sample mean x_hp = $984.08

      • Sample standard deviation s_hp = $91.96

        • Sample variance s_hp^2 = 8,457

      • Sample Size n_hp = 25

    • For the Dell laptop group:

      • Sample mean x_dell = $1101.39

      • Sample standard deviation s_dell = $107.29

        • Sample variance s_dell^2 = 11,511

      • Sample Size n_dell = 25

  1. Pooled Standard Deviation:

s_pooled = sqrt ( [(25−1)⋅8457+(25−1)⋅11511] / (25+25-2))
s_pooled ≈ 99.92

  1. Standard Error of the Difference in Means

SE = s_pooled x sqrt( 1/n_hp + 1/n_dell) = 99.92 * sqrt(1/25 + 1/25) 
SE ≈ 28.26

  1. Test statistic (T-score):

t = (x_hp - x_dell) / SE = (984.08−1101.39) / 28.26 ≈ −4.151

  1. P-value (Two-Tailed Test):

We have t = −4.151 with df = 25 + 25 - 2 = 48. For a two-tailed test, the p-value is: p = 2 × P(T > |4.151|) ≈ 0.00013

Conclusion:

The test shows a statistically significant difference (p-value < 0.05), and suggests that Dell laptops are, on average, more expensive than HP laptops.

Python Solution with Scipy and Statsmodels
# Both packages require the list of prices as opposed having only sample means
# Therefore we need to create the datasets first

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.weightstats import ttest_ind as ttest_ind_sm

# Sample data (same as in the manual example, but here we simulate it):
# Suppose these are the collected prices for HP and Dell laptops.
np.random.seed(21)  # For reproducibility

# HP laptop prices (25 data points)
hp_prices = np.random.normal(loc=1005, scale=95, size=25)

# Dell laptop prices (25 data points)
dell_prices = np.random.normal(loc=1080, scale=115, size=25)

# Print basic info
mean_hp = np.mean(hp_prices)
mean_dell = np.mean(dell_prices)
std_hp = np.std(hp_prices, ddof=1)
std_dell = np.std(dell_prices, ddof=1)

print("Data Summary:")
print(f"HP: mean={mean_hp:.2f}, std={std_hp:.2f}, n={len(hp_prices)}")
print(f"Dell: mean={mean_dell:.2f}, std={std_dell:.2f}, n={len(dell_prices)}\n")



# Assume equal variances (classic two-sample t-test assumption)
# Pooled variance and standard devation
sp_squared = ((n_hp - 1)*std_hp**2 + (n_dell - 1)*std_dell**2) / (n_hp + n_dell - 2)
sp = sqrt(sp_squared)
#print(sp)

# pooled standard error
se = (sp * sqrt((1/n_hp) + (1/n_dell)))
#print(se)

# T statistic
t_stat = (mean_hp - mean_dell) / se
#print(t_stat)

# Degrees of freedom
df = n_hp + n_dell - 2

# Two-tailed p-value from t-distribution
# p-value = 2 * P(T > |t_stat|)
p_value_manual = 2 * (1 - t.cdf(abs(t_stat), df))

print(f"T-statistic (manual): {t_stat:.3f}")
print(f"Degrees of freedom: {df}")
print(f"P-value (manual, two-tailed): {p_value_manual:.5f}")
print()

# SciPy Two-Sample T-Test (assuming equal variance by default)
t_stat_scipy, p_value_scipy = ttest_ind(hp_prices, dell_prices, equal_var=True)

print("SciPy Results:")
print(f"T-statistic (SciPy): {t_stat_scipy:.3f}")
print(f"P-value (SciPy): {p_value_scipy:.5f}\n")


# Statsmodels Two-Sample T-Test
# usevar='pooled' assumes equal variances, similar to ttest_ind default in SciPy
t_stat_sm, p_value_sm, df_sm = ttest_ind_sm(hp_prices, dell_prices, usevar='pooled', alternative='two-sided')
print("Statsmodels Results:")
print(f"T-statistic (Statsmodels): {t_stat_sm:.3f}")
print(f"P-value (Statsmodels): {p_value_sm:.5f}")


'''
Data Summary:
HP: mean=984.08, std=91.96, n=25
Dell: mean=1101.39, std=107.29, n=25

T-statistic (manual): -4.151
Degrees of freedom: 48
P-value (manual, two-tailed): 0.00013

SciPy Results:
T-statistic (SciPy): -4.151
P-value (SciPy): 0.00013

Statsmodels Results:
T-statistic (Statsmodels): -4.151
P-value (Statsmodels): 0.00013
'''

Note that to minimize false positives (FP) and false negatives (FN) while maximizing true positives (TP), we face trade-offs similar to balancing recall and precision in a problem in . As alpha (α) decreases, beta (β) increases, reducing statistical power. Factors influencing α and β include sample size, spread of distribution and the difference between assumption and observation.

The p-value for Z=2.39 (one-tailed) is the area under the standard normal curve to the right of 2.39. From , P(Z > 2.39) ≈ 0.0084 (one-tailed)

Classification
Machine Learning
standard normal distribution table
Credit:
Credit:
Modified from
https://analystprep.com/cfa-level-1-exam/quantitative-methods/one-tailed-vs-two-tailed-hypothesis-testing/
https://skacem.github.io/2021/06/07/Confusion-Matrix/
https://www.theanalysisfactor.com/confusing-statistical-terms-1-alpha-and-beta/