Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • What is a Confidence Interval?
  • 1. Calculating Confidence Intervals
  • 1.1. Mean
  • 1.2. Proportions

Was this helpful?

  1. STATISTICS
  2. Fundamentals

Confidence Interval

Last updated 4 months ago

Was this helpful?

Before exploring confidence intervals, it's important to understand and what a sample is. A sample is a subset of data drawn from a larger population, meant to represent it as a whole. Typically, the sample is a small fraction of the total population. The goal is that by analyzing the sample we can draw conclusions and make inferences about its population.

What is a Confidence Interval?

In simple terms, a Confidence Interval represents range of values that we are fairly sure contains the true value of an unknown population parameter. It has an associated confidence level which indicates how often the interval would include this value if the process were repeated. For example, if we have 90% confidence level, it implies means that 90 times out of 100 cases, we can expect the interval to capture the true population parameter.

Generating Confidence Intervals
"""
In the following Python Code, we will be resembling a fair coin toss.
In Step 1, we will generate the samples:
- we till toss a fair coin (p=0.5) 50 times record number of head counts
- we will then repeat this process 10 times, to have 10 samples with varying
number of head counts.
Keep in mind that Step 1 is optional, and you can use "heads" list and proceed
to Step 2 to generate the confidence intervals.

In Step 2, we will be generating Confidence Intervals using those samples. 
Depending on the confidence level, we may find confidence intervals that do 
not include the True Population Proportion, which 0.5 (for a fair toss coin).
"""
import numpy as np
import scipy.stats as st

# Step 1) Generate Head Counts
# Fix the seed for reproducibility 
np.random.seed(10)

# Confidence level
cl_90 = 0.90

# Number of trials
num_trials = 50

# Sample size
sample_size = 10

# Number of success, i.e. heads
heads = st.binom.rvs(num_trials, p=0.5, size=sample_size)
print(heads) # [28 18 26 27 25 22 22 28 22 20]

# Step 2) Generate Confidence Intervals 
for idx, v in enumerate(heads):
    result = st.binomtest(v, num_trials)
    ci = result.proportion_ci(cl_90, method="wilson")
    ci_formatted = (round(ci[0], 2), round(ci[1], 2))
    print(f"{idx+1}. {ci_formatted}")

"""
1. (0.44, 0.67)
2. (0.26, 0.48) <-- does not include True population proportion
3. (0.41, 0.63)
4. (0.43, 0.65)
5. (0.39, 0.61)
6. (0.33, 0.56)
7. (0.33, 0.56)
8. (0.44, 0.67)
9. (0.33, 0.56)
10. (0.29, 0.52)

9/10 (90%) our confidence intervals include True Population Proportion
"""

1. Calculating Confidence Intervals

1.1. Mean

For means, we take the sample mean then add and subtract the appropriate z-score (when σ is known or with Large Sample Size (n>30), or t-score when sigma is unknown) for our confidence level with the population standard deviation over the square root of the number of samples.

When σ is Known

The equation is simply tells us that the Confidence Interval is centered at the sample mean x_hat

extends 1.96 to each side of x_hat.

When σ is Unknown and Sample Size n ≥ 30:

We first calculate the sample standard deviation:

Then, compute the confidence interval:

When σ is Unknown and Sample Size n < 30:

We rely on Student's t-distribution:

Example in Python
import scipy.stats as st
import numpy as np

# Method 1 - Manual
# Sample data with n=10 (1 to 10)
n = 10
a = range(1,n+1)

# Mean
m = np.mean(a)
# Standard deviation
s = np.std(a,ddof=1)

# Critical t-score since the sample size 10 with alpha = 0.05
alpha = 0.05
dof = n-1
t_crit = st.t.ppf(1-alpha/2, dof)

# Confidence interval
ci_manual = (m - t_crit * s / np.sqrt(n), m + t_crit * s / np.sqrt(n)) # s / np.sqrt(n) is called the standard error of the mean
print(ci_manual)
# 3.3341494102783162, 7.665850589721684)

# Method 2 - Using scipy's interval method
ci_scipy = st.t.interval(1-alpha, dof, loc=m, scale = st.sem(a))
print(ci_scipy)
# (3.3341494102783162, 7.665850589721684)

1.2. Proportions

For proportions, we take the sample proportion add and subtract the z score times the square root of the sample proportion times its complement, over the number of samples.

Example in Python
import numpy as np
import scipy.stats as st
from statsmodels.stats.proportion import proportion_confint


# Sample data
n = 100  # Sample size
x = 60   # Number of successes (favorable responses)
p_hat = x / n  # Sample proportion

# Confidence level
alpha = 0.05  # 95% confidence level

# Standard error for proportion
se = np.sqrt(p_hat * (1 - p_hat) / n)

# Critical z-score for 95% confidence level
z_crit = st.norm.ppf(1 - alpha / 2)

# Method l - Manual
ci_manual = (p_hat - z_crit * se, p_hat + z_crit * se)

# Method 2 - Using scipy's interval method
ci_scipy = st.norm.interval(1 - alpha, loc=p_hat, scale=se)

# Method 3 - Using statsmodels' proportion_confint method
ci_statsmodels = proportion_confint(x, n, alpha)

print("Manual CI:", ci_manual)
print("Scipy CI:", ci_scipy)
print("Statsmodels CI:", ci_statsmodels)
# Manual CI:      (0.5039817664728938, 0.6960182335271061)
# Scipy CI:       (0.5039817664728938, 0.6960182335271061)
# Statsmodels CI: (0.5039817664728937, 0.6960182335271062)

sampling
Credit:
https://www.colorado.edu/amath/sites/default/files/attached-files/lesson7_cis1sample_0.pdf