# Confidence Interval

Before exploring confidence intervals, it's important to understand [sampling](/ds-hub/statistics/fundamentals/sampling.md) and what a sample is. A sample is a subset of data drawn from a larger population, meant to represent it as a whole. Typically, the sample is a small fraction of the total population. The goal is that by analyzing the sample we can draw conclusions and make inferences about its population.&#x20;

## What is a Confidence Interval?

In simple terms, a Confidence Interval represents range of values that we are fairly sure contains the true value of an unknown population parameter. It has an associated confidence level which indicates how often the interval would include this value if the process were repeated. For example, if we have 90% confidence level, it implies means that 90 times out of 100 cases, we can expect the interval to capture the true population parameter.

<figure><img src="/files/sQadjq1iIHbx7ibjMX9W" alt=""><figcaption><p>Credit: <a href="https://www.colorado.edu/amath/sites/default/files/attached-files/lesson7_cis1sample_0.pdf">https://www.colorado.edu/amath/sites/default/files/attached-files/lesson7_cis1sample_0.pdf</a></p></figcaption></figure>

<details>

<summary>Generating Confidence Intervals</summary>

```python
"""
In the following Python Code, we will be resembling a fair coin toss.
In Step 1, we will generate the samples:
- we till toss a fair coin (p=0.5) 50 times record number of head counts
- we will then repeat this process 10 times, to have 10 samples with varying
number of head counts.
Keep in mind that Step 1 is optional, and you can use "heads" list and proceed
to Step 2 to generate the confidence intervals.

In Step 2, we will be generating Confidence Intervals using those samples. 
Depending on the confidence level, we may find confidence intervals that do 
not include the True Population Proportion, which 0.5 (for a fair toss coin).
"""
import numpy as np
import scipy.stats as st

# Step 1) Generate Head Counts
# Fix the seed for reproducibility 
np.random.seed(10)

# Confidence level
cl_90 = 0.90

# Number of trials
num_trials = 50

# Sample size
sample_size = 10

# Number of success, i.e. heads
heads = st.binom.rvs(num_trials, p=0.5, size=sample_size)
print(heads) # [28 18 26 27 25 22 22 28 22 20]

# Step 2) Generate Confidence Intervals 
for idx, v in enumerate(heads):
    result = st.binomtest(v, num_trials)
    ci = result.proportion_ci(cl_90, method="wilson")
    ci_formatted = (round(ci[0], 2), round(ci[1], 2))
    print(f"{idx+1}. {ci_formatted}")

"""
1. (0.44, 0.67)
2. (0.26, 0.48) <-- does not include True population proportion
3. (0.41, 0.63)
4. (0.43, 0.65)
5. (0.39, 0.61)
6. (0.33, 0.56)
7. (0.33, 0.56)
8. (0.44, 0.67)
9. (0.33, 0.56)
10. (0.29, 0.52)

9/10 (90%) our confidence intervals include True Population Proportion
"""


```

</details>

## 1. Calculating Confidence Intervals

### 1.1. Mean

For means, we take the sample mean then add and subtract the appropriate z-score (when σ is known or with Large Sample Size (n>30), or  t-score when sigma is unknown) for our confidence level with the population standard deviation over the square root of the number of samples.&#x20;

#### When σ is Known

<figure><img src="/files/bUIpxYkOPcjYo8xX6o4Y" alt="" width="188"><figcaption></figcaption></figure>

The equation is simply tells us that the Confidence Interval is centered at the sample mean x\_hat

extends 1.96 to each side of x\_hat.

#### When σ is Unknown and Sample Size n ≥ 30:

We first calculate the sample standard deviation:

<figure><img src="/files/QtQjnW56vocUVG55l81T" alt="" width="188"><figcaption></figcaption></figure>

Then, compute the confidence interval:

<figure><img src="/files/KheUhU374TiRNkOjA7gd" alt="" width="165"><figcaption></figcaption></figure>

#### When σ is Unknown and Sample Size n < 30:

We rely on *Student's t-distribution*:

<figure><img src="/files/fCdvUB0CX0Enh4BIqcut" alt="" width="188"><figcaption></figcaption></figure>

<details>

<summary>Example in Python</summary>

```python
import scipy.stats as st
import numpy as np

# Method 1 - Manual
# Sample data with n=10 (1 to 10)
n = 10
a = range(1,n+1)

# Mean
m = np.mean(a)
# Standard deviation
s = np.std(a,ddof=1)

# Critical t-score since the sample size 10 with alpha = 0.05
alpha = 0.05
dof = n-1
t_crit = st.t.ppf(1-alpha/2, dof)

# Confidence interval
ci_manual = (m - t_crit * s / np.sqrt(n), m + t_crit * s / np.sqrt(n)) # s / np.sqrt(n) is called the standard error of the mean
print(ci_manual)
# 3.3341494102783162, 7.665850589721684)

# Method 2 - Using scipy's interval method
ci_scipy = st.t.interval(1-alpha, dof, loc=m, scale = st.sem(a))
print(ci_scipy)
# (3.3341494102783162, 7.665850589721684)


```

</details>

### 1.2. Proportions

For proportions, we take the sample proportion add and subtract the z score times the square root of the sample proportion times its complement, over the number of samples.

<figure><img src="/files/6VU1OfgKfpOICTrteinD" alt="" width="188"><figcaption></figcaption></figure>

<details>

<summary>Example in Python</summary>

```python
import numpy as np
import scipy.stats as st
from statsmodels.stats.proportion import proportion_confint


# Sample data
n = 100  # Sample size
x = 60   # Number of successes (favorable responses)
p_hat = x / n  # Sample proportion

# Confidence level
alpha = 0.05  # 95% confidence level

# Standard error for proportion
se = np.sqrt(p_hat * (1 - p_hat) / n)

# Critical z-score for 95% confidence level
z_crit = st.norm.ppf(1 - alpha / 2)

# Method l - Manual
ci_manual = (p_hat - z_crit * se, p_hat + z_crit * se)

# Method 2 - Using scipy's interval method
ci_scipy = st.norm.interval(1 - alpha, loc=p_hat, scale=se)

# Method 3 - Using statsmodels' proportion_confint method
ci_statsmodels = proportion_confint(x, n, alpha)

print("Manual CI:", ci_manual)
print("Scipy CI:", ci_scipy)
print("Statsmodels CI:", ci_statsmodels)
# Manual CI:      (0.5039817664728938, 0.6960182335271061)
# Scipy CI:       (0.5039817664728938, 0.6960182335271061)
# Statsmodels CI: (0.5039817664728937, 0.6960182335271062)
```

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dshub.gitbook.io/ds-hub/statistics/fundamentals/confidence-interval.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
