Hypothesis Testing

1. What is Hypothesis Testing?

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It involves formulating a null hypothesis (H0) representing no effect or relationship and an alternative hypothesis (H1) indicating the presence of an effect or relationship. Statistical tests are then applied to determine whether the observed data is significant enough to reject the null hypothesis in favor of the alternative hypothesis. Common tests include z-test, t-tests, chi-square tests, and ANOVA, and they provide p-values to quantify the strength of the evidence against the null hypothesis.

1.1. Assumptions

  • Random Sampling: The data must be collected randomly from the population.

  • Independence: Observations should be independent of each other.

  • Normality: The data should be approximately normally distributed, especially for small sample sizes.

  • Homogeneity of Variance (Constant variance): For certain tests like ANOVA or t-tests, the variance among the groups being compared should be approximately equal.

1.2. Building Hypothesis Tests

Hypothesis
Two-tailed test
One tailed test

Null H0

Estimate = Value

Estimate ≥ Value (or Estimate < Value)

Alternative H1

Estimate ‡ Value

Estimate < Value ( or Estimate > Value)

When we conduct our test, we will get a test statistic, such as a z score or t statistic. With this statistic, we can calculate the p-value, which indicates the probability of obtaining our sample results, assuming the null hypothesis is true. If the p-value is small enough, falling in the blue (above image), then we reject the null hypothesis.

We use the significance level (alpha) to determine how large of an effect we need to reject the null hypothesis, or how certain we need to be. A common alpha value is 0.05, which represents 95% confidence in our test.

1.3. Which one to use? Z-test of T-test?

When we get our outcome, there will always be a probability of obtaining false results; this is what our significance level and power are for. There are two types of errors that we can get. Let's look at a confusion matrix for more on this, with our predictions on the y-axis. Type I errors or false positives, shown in the top right, occur when you incorrectly reject a true null hypothesis. Type II errors or false negatives, shown in the bottom left, occur when you accept a null hypothesis when an effect really exists. This means that we predicted no effect when there really was an effect.

Image below shows the relationship of the parameters we have talked so far for one-tailed hypothesis testing:

Note that to minimize false positives (FP) and false negatives (FN) while maximizing true positives (TP), we face trade-offs similar to balancing recall and precision in a Classification problem in Machine Learning. As alpha (α) decreases, beta (β) increases, reducing statistical power. Factors influencing α and β include sample size, spread of distribution and the difference between assumption and observation.

Sample size and confidence level are negatively correlated with Type II error, while minimum effect size causes a higher chance of Type II error.

2. Real Life Examples

2.1. A/B Testing a Landing Page

An online retailer ran an A/B test and obtained the following conversions per control and treatment groups:

Company would like to test the hypothesis that the new design indeed yields to better conversions. Test this hypothesis with 5% significance level (alpha).

Solution:

Since the company wants to test if the new design (treatment) yields more conversions than the old design (control), a one-tailed test is appropriate.

  • Null Hypothesis (H0): p_t ≤ p_c

  • Alternative Hypothesis (H1): p_t​ > p_c​

    • Significance level α=0.05

The control and treatment groups typically represent different sets of users or sessions. In typical A/B tests, each user sees only one version of the page (control or treatment), so the samples are independent. There is no pairing or linking between individuals in the control and treatment groups. Therefore, we assume the groups are independent.

Manual Calculation

  1. Calculate Sample Proportions:

  1. Pooled Proportion:

  1. Standard Error:

  1. Test Statistic (Z-score):

  1. P-value (One-Tailed):

The p-value for Z=2.39 (one-tailed) is the area under the standard normal curve to the right of 2.39. From standard normal distribution table, P(Z > 2.39) ≈ 0.0084 (one-tailed)

Conclusion:

Since p-value ≈ 0.0084 < 0.05 (our alpha), we reject the null hypothesis and conclude that the new page (treatment) likely leads to a higher conversion rate than the old (control) page.

Python Solution with Scipy and Statsmodels

2.2. Price Check: Which one is more expansive?

A large electronics retailer work with numerous laptop suppliers, and two of the major brands are HP and Dell. The store stocks a variety of laptop models from each brand—ranging from budget-friendly options to high-end machines. The management wants to understand whether there’s a consistent price difference between the HP and Dell laptops. Specifically, they ask: "On average, are the Dell laptops priced higher than the HP laptops we stock?" This is important because it might influence:

  • Inventory decisions (should you stock more of the brand that consistently costs less?),

  • Marketing and promotions (do you highlight HP for value-seeking customers and Dell for performance-seeking customers?),

  • Negotiations with suppliers (if the price difference is significant, maybe you can negotiate better deals).

Data Collection:

  • Over the past month, you randomly select 25 different HP laptop models and record their average in-store price.

  • Similarly, you randomly select 25 different Dell laptop models and record their average in-store price.

Formulating the Hypotheses:

  • Null Hypothesis (H0): μHP=μDell (The average prices are equal)

  • Alternative Hypothesis (H1): μHP≠μDell (The average prices are not equal)

  • Assumption: Both populations have equal variance! (which affects how we calculated the pooled variance)

Manual Calculation

  1. Sample Stats

    • For the HP laptop group:

      • Sample mean x_hp = $984.08

      • Sample standard deviation s_hp = $91.96

        • Sample variance s_hp^2 = 8,457

      • Sample Size n_hp = 25

    • For the Dell laptop group:

      • Sample mean x_dell = $1101.39

      • Sample standard deviation s_dell = $107.29

        • Sample variance s_dell^2 = 11,511

      • Sample Size n_dell = 25

  1. Pooled Standard Deviation:

  1. Standard Error of the Difference in Means

  1. Test statistic (T-score):

  1. P-value (Two-Tailed Test):

We have t = −4.151 with df = 25 + 25 - 2 = 48. For a two-tailed test, the p-value is: p = 2 × P(T > |4.151|) ≈ 0.00013

Conclusion:

The test shows a statistically significant difference (p-value < 0.05), and suggests that Dell laptops are, on average, more expensive than HP laptops.

Python Solution with Scipy and Statsmodels

Last updated

Was this helpful?