Sampling

Nearly all statistical findings are based on measurements taken from a sample, not the entire population. Major decisions often rely on information gathered from these samples. Take Nielsen ratings, for example—they collect data from a small sample of homes to predict television-viewing patterns for the entire country. Picking the right sample is a crucial step to ensure accurate statistical conclusions.

Reasons to Sample:

  • not feasible to measure an entire population

  • when feasible:

    • costs money and/or time

    • provides minimal added benefits compared to measuring a representative sample

There are several sampling methods used in research, and determining the exact number for sampling methods is difficult because there can be variations on existing techniques. However, there are two main categories with established sub-methods:

  1. Probability Sampling

  2. Non-Probability Sampling

Probability Sampling methods rely on randomization to select a sample, ensuring every member of the population has a known chance of being chosen. This allows for statistical inferences about the whole group. In this article, we will focus on the Probability Sampling methods.

1. Probability Sampling

Probability sampling is a method of sampling in which each member of a population has a known, non-zero probability of being selected into the sample. The key characteristic of probability sampling methods is that they involve randomness and ensure that every element of the population has a chance to be included in the sample. If this isn't achieved, it results in a biased sample, potentially leading to inaccurate and misleading outcomes.

Probability Sampling methods are a subset of Random Sampling methods. Random sampling broadly refers to any sampling technique where samples are selected randomly from the population, which includes both probability sampling and non-probability sampling.

There are four main probability sampling methods to obtain a random samples:

  1. Simple Random

  2. Systematic

  3. Stratified

  4. Cluster

There is also a 5th method, called Multistage Random Sampling and it is a combination of 2 or more or other probability sampling methods.

1.1. Probability Sampling Methods

1.1.1. Simple Random

A simple random sample is a sample in which every member of the population has an equal chance of being chosen.

  • Process: Individuals are chosen randomly and independently, often using random number generators or drawing lots.

  • Example: Selecting names from a hat or using a computer-generated random sequence.

  • Drawback: The personal bias or sample may not adequately represent certain subgroups within the population. There's a possibility that important characteristics are not evenly distributed, leading to underrepresentation or overrepresentation of specific groups.

  • Overcoming Challenge: To ensure representation of all subgroups, we can use stratification during the sampling process. This involves dividing the population into strata based on relevant characteristics and then applying simple random sampling within each stratum.

Simple Random Sampling in Python
import pandas as pd

# import dataset (to download please go to the bottom of this page)
cars = pd.read_csv('../datasets/uk_car_sales_1996_2020.csv') 

# create a subset, first 1000 rows and display the histogram
cars_sample_first_1000 = cars[:1000]

# take a random sample of the data
cars_sample_random_1000 = cars.sample(n=1000, random_state=21) # random_state ensure reproducability of the results

# calculate means
pop_mean = round(cars['price'].mean(),2)
sample_mean = round(cars_sample_first_1000['price'].mean(),2)
sample_mean_random = round(cars_sample_random_1000['price'].mean(),2)

# display sample mean
print(f"Population Mean: {pop_mean} \n\
Sample Mean (First 1000): {sample_mean} \n\
Sample Mean (Random): {sample_mean_random}")

# 0utput
"""
Population Mean: 16805.42 
Sample Mean (First 1000) : 23012.5 
Sample Mean (Random): 16804.94
"""

Note that the reason why the sample mean for first 1000 rows, 23012.5, is much higher than the population mean, 16805.42. This is because the sample consists of 'only' Audi models, which have higher selling prices than an average car. In contrast, the random sample yields almost the same mean, 16804.94, as the population mean.

When we also check the distributions of the samples, we would see that the simple random sample distribution follows the population's distribution more than that or first 1000 row sample.

import seaborn as sns
import matplotlib.pyplot as plt

# display histograms for 'price' column
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16,8))

sns.histplot(x='price', data=cars,bins=50, ax=axes[0])
axes[0].set_title('Population')
sns.histplot(x='price', data=cars_sample_first_1000,bins=50, color='orange', ax=axes[1])
axes[1].set_title('Sample: First 1000 Rows')
sns.histplot(x='price', data=cars_sample_random_1000,bins=50, color='g', ax=axes[2])
axes[2].set_title('Sample: Random 1000 Data Points')
plt.show()

1.1.2. Systematic

Every kth member of the population, can also be thought as an interval, is chosen for the sample (k = N/n, where N = size of the population, n = the size of the sample) from a list after a random start.

  • Process: A random starting point is chosen, and then every kth individual is included in the sample.

  • Example: Selecting every 10th person from a list after randomly choosing a starting point.

  • Drawback: If there's a periodic pattern in the population, systematic sampling might lead to biased results. For example, if the list is sorted in a way that aligns with the sampling interval (kth value), the sample may not be representative.

  • Overcoming Challenge: Randomize the starting point in the sampling process. If the list has a periodic pattern, starting at a random point minimizes the risk of bias. Additionally, one can assess the periodicity of the list and adjust the sampling interval accordingly.

Systematic Sampling in Python
# import dataset
cars = pd.read_csv('../datasets/uk_car_sales_1996_2020.csv')

# define an interval
sample_size = 1000
pop_size = len(cars)
k_th_value = pop_size // sample_size # or interval
print(k_th_value)
# 99

# take a sample 
cars_sample_systematic = cars[::interval]

# display number of rows in the dataset
print(f"number of records: {len(cars_sample_systematic)}")
# 1002

# shuffle the dataset to avoid any patterns
cars_shuffled = cars.sample(frac=1)

# take a sample 
cars_sample_systematic_shuffled = cars_shuffled[::interval]

# re-index the sample
cars_sample_systematic_shuffled = cars_sample_systematic_shuffled.reset_index(drop=True).reset_index()

# display scatter plot
g = sns.relplot(data = cars_sample_systematic_shuffled, x='index',y='price', kind='scatter', alpha=0.5) 
plt.title('Systematic Sampling')
plt.show()

# calculate sample mean
sample_mean_systematic = round(cars_sample_systematic_shuffled['price'].mean(),2)

# display means
print(f"Population Mean: {pop_mean} \n\
Sample Mean (Systematic): {sample_mean_systematic}")

# Output
# Population Mean: 16805.42 
# Sample Mean (Systematic): 16715.31

1.1.3. Stratified

A stratified sample is acquired by dividing the population into distinct and non-overlapping (i.e. mutually exclusive) groups, known as strata, based on certain characteristics (such as age, gender, or income). Samples are then randomly selected from each of these groups.

  • Process: Random samples are drawn from each subgroup, ensuring representation from all strata.

  • Example: If studying a population of students, strata may be created based on grade levels.

  • Drawback: Identifying and accurately classifying all relevant strata can be challenging. If the classification is incorrect or if important strata are omitted, the sample may not accurately reflect the population.

  • Overcoming Challenge: Ensure accurate classification of strata by using reliable information and updated data. Conduct a thorough analysis of the population characteristics to identify relevant strata. Adequate research and understanding of the population can help address this challenge.

Stratified Sampling in Python

We will create three stratified samples using different scenarios, including random, equal count, and weighted :

# create a dataframe with top brands
cars_top_counts = cars['make'].value_counts().reset_index()

# filter brands with more than 12500 datapoints
cars_top_counts = cars_top_counts[cars_top_counts['count']>12500]

# extract top brands
top_brands = list(cars_top_counts['make'].values)

# sub-set cars dataframe
cars_top = cars[cars['make'].isin(top_brands)]
print(cars.shape,cars_top.shape)
# (99178, 9) (59864, 9)

# diplay propotions of makes/brands
cars_top['make'].value_counts(normalize=True)
# Output
"""
make
Ford          0.300047
Volkswagen    0.253157
Vauxhall      0.227683
Mercedes      0.219113
Name: proportion, dtype: float6
"""

# 1) Stratifed Sampling - Proportional
# create proportional stratified sample with 10% sampling rate per make/brand
cars_sample_strat = cars_top.groupby("make").sample(frac=0.1, random_state=21)
print(cars_top.shape,cars_sample_strat.shape)
# (59864, 9) (5987, 9)

# diplay propotions of makes/brands
cars_sample_strat['make'].value_counts(normalize=True)

# Output
"""
make
Ford          0.299983
Volkswagen    0.253215
Vauxhall      0.227660
Mercedes      0.219141
Name: proportion, dtype: float64
"""

# 2) Stratifed Sampling - Equal Count
# create equal count stratified sample with n=1000 samples from each make/brand
cars_sample_strat_eq_cnt = cars_top.groupby("make").sample(n=1000, random_state=21)
print(cars_top.shape,cars_sample_strat_eq_cnt.shape)
# (59864, 9) (4000, 9)

# diplay propotions of makes/brands
cars_sample_strat_eq_cnt['make'].value_counts(normalize=True)

# Output
"""
make
Ford          0.25
Mercedes      0.25
Vauxhall      0.25
Volkswagen    0.25
Name: proportion, dtype: float64
"""

# 3) Stratifed Sampling - Weighted
# copy the dataframe
cars_top_weighted = cars_top.copy(deep=True)

# give 'Mercedes' five times more weight than the others
cars_top_weighted['weight'] = np.where(cars_top_weighted['make'] == "Mercedes", 5,1)

# create weighted stratified sample with 10% sampling rate
cars_sample_strat_weighted = cars_top_weighted.sample(frac=0.1, weights="weight",random_state=21)
print(cars_top.shape,cars_sample_strat_weighted.shape)
# (59864, 9) (5986, 10)

# diplay propotions of makes/brands
cars_sample_strat_weighted['make'].value_counts(normalize=True)
"""
make
Mercedes      0.556966
Ford          0.169562
Volkswagen    0.143502
Vauxhall      0.129970
Name: proportion, dtype: float64
"""

# calculate and display means
sample_mean_strat = round(cars_sample_strat['price'].mean(),2)
sample_mean_strat_eq_cnt = round(cars_sample_strat_eq_cnt['price'].mean(),2)
sample_mean_strat_weighted = round(cars_sample_strat_weighted['price'].mean(),2)

print(f"Population Mean: {pop_mean} \n\
Sample Mean (Stratified - Random with 10%): {sample_mean_strat}\n\
Sample Mean (Stratified - Equal Count): {sample_mean_strat_eq_cnt}\n\
Sample Mean (Stratified - Weighted with 10%): {sample_mean_strat_weighted}\n\
")

# Output
"""
Population Mean: 16805.42 
Sample Mean (Stratified - Random with 10%): 15732.58
Sample Mean (Stratified - Equal Count): 16102.73
Sample Mean (Stratified - Weighted with 10%): 19714.32
"""

1.1.4. Cluster:

A cluster sample involves selecting random groups or clusters (if possible) from the population, and every member within the chosen clusters becomes part of the final sample.

  • Process: Clusters are randomly selected, and all individuals within the chosen clusters are included.

  • Example: If studying a city's population, clusters might be neighborhoods, and random neighborhoods are selected for the study.

  • Drawback: If clusters are not truly representative of the overall population, the sample may lack diversity. Additionally, if there is significant variability within clusters, the results may not be as accurate.

  • Overcoming Challenge: Ensure that clusters are truly representative of the population by conducting a preliminary survey or assessment. If there is variability within clusters, increase the number of clusters to capture a more diverse range of characteristics. Randomly selecting clusters enhances the chances of representativeness.

Cluster Sampling in Python
# create a population from all models
models_pop = list(cars['model'].unique())
print(f"Number of Models (Population): {len(models_pop)}")
#Number of Models (Population): 195

# create a random sample of models with 10% sampling rate
models_sample = np.random.choice(models_pop, size=len(models_pop)//10)
print(f"Number of Models (Sample): {len(models_sample)}\n")
#Number of Models (Sample): 19

# create cluster sampling
cars_sample_cluster = cars[cars['model'].isin(models_sample)]

# calculate sample mean
sample_mean_cluster = round(cars_sample_cluster['price'].mean(),2)

# display means
print(f"Population Mean: {pop_mean} \nSample Mean (Cluster): {sample_mean_cluster}")

# Output
"""
Population Mean: 16805.42 
Sample Mean (Cluster): 18690.0 # this will be different updated due to np.random.choice()
"""

Multistage Random Sampling:

The method combines multiple stages of sampling. It often involves a combination of stratified, cluster, and simple random sampling.

  • Process: Sampling occurs in several stages, with different methods applied at each stage.

  • Example: Selecting states randomly, then randomly selecting cities within chosen states, and finally randomly selecting individuals within those cities.

  • Drawback: The complexity of multistage sampling increases the chance of errors at each stage. If there are errors in any stage, they can propagate and affect the overall representativeness of the sample.

  • Overcoming Challenge: Implement thorough quality control measures at each stage of the sampling process. Regularly review and update the sampling frame to account for changes in the population. Additionally, conduct sensitivity analyses to assess the impact of potential errors at each stage.

Random sampling methods help minimize bias and increase the likelihood that the sample accurately represents the population, allowing for more robust statistical analysis and generalization of findings.

It's important to note that the drawbacks mentioned are potential challenges, and we aim to minimize these issues through careful planning and execution. The choice of a sampling method depends on the project's objectives, the nature of the population, and the available resources.

We should be aware of the limitations and take steps to address or mitigate them to ensure the validity and reliability of their findings. In general, they should prioritize transparency in their sampling methods, document the entire process, and report any limitations in their findings. Regularly reviewing and refining the sampling strategy based on ongoing data collection and analysis can help ensure the accuracy and representativeness of the sample. It's crucial to strike a balance between practical constraints and the need for a representative sample, considering the unique characteristics of the population under study.

You can access the dataset and the full script on https://github.com/sedarsahin/Sampling/tree/main

Last updated