Sample Size Calculation

How do we decide on how long a test should run, or in our terms, how many observations do we need per group? This question is relevant because it's normally advised that you decide on a sample size before you start an experiment. While many A/B testing guides attempt to provide general advice, the reality is that it varies case by case. A common approach for overcoming this problem referred to as the power analysis.

Power Analysis

We perform power analysis to generate needed sample size, and the it includes the following metrics:

  1. Effect size (calculated via lift): the minimum size of the effect that we want to detect in a test; for example, a 5% increase in conversion rates.

    1. For testing the differences in means, after selecting the suitable minimum detectable effect(MDE) of interest, we convert it into a standardized effect size known as Cohen's d defined as the difference between the two means divided by the standard deviation:

      Cohen 's d = (µB -µA) / stdev_pooled

    2. For differences in proportions, a common effect size to use is Cohen's h calculated using the formula:

      Cohen' s h = 2 arcsin (sqrt(p1)) - 2 arcsin (sqrt(p2))

    A general rule of thumb:

    • 0.2 corresponds to a small effect,

    • 0.5 is a medium effect,

    • 0.8 is large.

  2. Significance Level (predetermined): Alpha value; 5% is typical.

  3. Power (predetermined): Probability of detecting an effect

Keep in mind that if we change any of the above metrics, the needed Sample size also changes.

More power, a smaller significance level, or detecting a smaller effect all lead to a larger sample size.

Effect Size, Sample Size and the Power

Below is the Power of Test graph with varying sample and effect sizes:

Code to produce the image above:

Last updated

Was this helpful?