Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • 1. How often do we get orders?
  • 2. Does the dataset actually follow an Exponential Distribution?
  • 3. Is the lambda value we chose really the best value to define the distribution?

Was this helpful?

  1. STATISTICS
  2. Distributions

Exponential Distribution

Last updated 1 year ago

Was this helpful?

The Exponential Distribution is a continuous distribution that models the time between events in a Poisson process, such as the time between website visits or customer loyalty. It is defined by a single parameter, lambda (λ), which represents the rate at which events occur. The Exponential Distribution is memoryless, meaning that the time between events does not depend on the time since the last event.

In the following example we will cover how exponential distribution is being used by an online e-commerce company and answer the following questions:

  1. How often do we get orders?

  2. Does the dataset actually follow an Exponential Distribution?

  3. Is the lambda value we chose really the best value to define the distribution?

You can access to the dataset . The full Jupyter notebook is accessible via

1. How often do we get orders?

The number of visits between orders for an e-commerce website in a certain time period in January 2024 is stored in the array order_visits.

If we assume that website visits resulting in orders can be modeled using a Poisson distribution, then the time between these visits (also known as the interval time) follows an Exponential distribution.

The Poisson distribution is a suitable model for counting the number of events (orders) occurring in a fixed interval (e.g., website visits) if the following conditions are met:

  1. Events are independent: Each website visit is an independent event, and the occurrence of an order does not affect the probability of another order.

  2. Events occur at a constant rate: The rate at which orders occur is constant over the fixed interval (website visits).

  3. Events are rare: The probability of an order occurring in a single website visit is relatively low (less than 10-15%)

The Exponential distribution is a continuous distribution that describes the time between events, i.e. the interval time, in a Poisson process. It has a single parameter, λ(lambda), a.k.a rate parameter, which represents the rate at which events occur.

The Exponential distribution has several important properties, including:

  • Memorylessness: the time between events does not depend on the time since the last event

  • Constant rate: the rate at which events occur is constant over time

  • Exponential decay: the probability of waiting for a certain amount of time before the next event decays exponentially

λ is simply the average number of visits between orders and will be calculated directly from the dataset (order_visits). The mean of visits between orders will make the exponential distribution best fit to the data.

import numpy as np
import matplotlib.pyplot as plt


f ='./datasets/order_visits.csv'
order_visits = np.loadtxt(f, delimiter=',', dtype='int')

# set the seed for reproducability
np.random.seed(21)

# compute lambda, mean time (in units of number of visits) between orders.
lambda_ = np.mean(order_visits)

# draw 50,000 samples from an exponential distribution with the rate parameter lambda_
interval_time_btw_orders = np.random.exponential(scale = lambda_, size = 50000)

# plot the theoretical PDF along with the axes' label
plt.hist(interval_time_btw_orders, bins= 50, color='r',histtype='step', density=True)
plt.xlabel('Visits between orders')
plt.ylabel('PDF')

# show the plot
plt.show()

Note the shape of the distribution of the samples, it is maximum at 0 and decays as we go far from the origin, characteristic to the Exponential distribution.

2. Does the dataset actually follow an Exponential Distribution?

To answer this question we will create an empirical cumulative distribution function (eCDF) of the real data and compare it with the theoretical CDF (tCDF). If the two overlaps, we can that conclude that the Exponential distribution describes the observed data.

In order to draw such conclusion we will follow the following steps:

  1. Create a eCDF function, and use it to compute CDF from the actual dataset (order_visits).

  2. We will use the same function to compute tCDF from the theoretical samples (interval_time_btw_orders).

  3. Plot x_theor and y_theor Then overlay the ECDF of the real data x and y as points.

The empirical CDF is usually defined as

CDF(x) = (Number of Samples <= x) / Number of Samples

# The empirical CDF
def ecdf(data):
    """Computes CDF for a one-dimensional array."""
    n = len(data)    # Number of data points
    x = sorted(data) # x-data for the CDF
    y = np.arange(1, n+1)/n   # y-data for the CDF
    return x,y
# compute lambda
lambda_ = np.mean(order_visits)

# create an exponential distribution with parameter lambda and 50000 samples
interval_time_btw_orders = np.random.exponential(lambda_,50000)

# create an eCDF from real data
x, y = ecdf(order_visits)

# create a tCDF from theoretical data
x_theor, y_theor = ecdf(interval_time_btw_orders)

# display theoretical values
plt.plot(x_theor,y_theor, color = 'black',label='theoretical')
# display data values
plt.scatter(x,y, facecolors='none', edgecolors='r', color= 'red', label='datapoints')

# define labels and display legend
plt.xlabel('Visits Between Orders')
plt.ylabel('CDF')
plt.legend()        

# plot the graph
plt.show()

It looks like visits result in orders (order_visits) are indeed Exponentially distributed since the dataset closely follows the theoretical cumulative distribution.

3. Is the lambda value we chose really the best value to define the distribution?

Let's test this we two different values for lambda:

  1. 3 times the lambda

  2. 1/3 of the lambda

We will draw new samples using these values and ingest them to our eCDF function. Then plot them along with actual and theoretical sample obtained via lambda.

# recall lambda = np.mean(order_visits)
lambda_3x = lambda_*3
lambda_03 = lambda_*(1/3)

# draw 50000 samples out of an Exponential distribution with rate_parameter lambda_3x and rate_parameter lambda_03
interval_time_btw_orders_3x = np.random.exponential(lambda_3x,50000)
interval_time_btw_orders_03 = np.random.exponential(lambda_03,50000)


# Create a tCDFs from the two samples 
x_theor3x, y_theor3x = ecdf(interval_time_btw_orders_3x)
x_theor03, y_theor03 = ecdf(interval_time_btw_orders_03)


# display real data
plt.scatter(x,y, marker = 'o', facecolors='none', edgecolors='red', label='datapoints')

# display all theoretical samples
plt.plot(x_theor, y_theor,color='black', label='lambda')
plt.plot(x_theor3x, y_theor3x, color='orange',linestyle='-.', label='lambda_3x')
plt.plot(x_theor03, y_theor03, color='green',linestyle='--',label='lambda_1/3')

# labels and legend
plt.xlabel('Visits between orders')
plt.ylabel('CDF')
plt.legend()
plt.show()

Note that only the mean value of order_visits, i.e. lambda, fits the data almost perfectly. Therefore we can conclude that the mean of order_visits, is the best value for the rate parameter, lambda.

here
https://github.com/sedarsahin/Distributions/blob/main/Exponential/interval_between_orders.ipynb