Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • 1. The Elbow Method: Finding the Optimal Number of Clusters
  • 2. Silhouette Score: Measuring Cluster Cohesion and Separation
  • 3. Davies-Bouldin Index: Evaluating Cluster Compactness and Separation
  • 4. PCA Visualization: Reducing Dimensions to Visualize Clusters

Was this helpful?

  1. Machine Learning
  2. Fundamentals
  3. UnSupervised Learning
  4. Clustering

Evaluation

Clustering is a widely used technique in machine learning to group data into subsets (clusters) based on similarities. Once clustering is performed, evaluating its performance is crucial, especially when you don't have labels to guide you. In this post, we will dive into four popular techniques for evaluating clustering results: the Elbow Method, Silhouette Score, Davies-Bouldin Index, and PCA Visualization. Each of these methods helps us assess the quality of clustering and gain insights into the data.

1. The Elbow Method: Finding the Optimal Number of Clusters

The Elbow Method is a heuristic used to determine the optimal number of clusters in a dataset when using algorithms like K-means.

How it Works:

  1. Fit the Clustering Model: For each number of clusters k (e.g., 1, 2, 3, …), apply the clustering algorithm (e.g., K-means) to your dataset.

  2. Calculate the WCSS (Within-Cluster Sum of Squares): For each k, calculate the WCSS, which is the sum of squared distances from each point in the cluster to its assigned centroid.

    The formula for WCSS is

WCSS = \sum_{i=1}^{N} \sum_{j=1}^{k} | x_i - c_j |^2

Where:

  • xi is a data point,

  • cj​ is the centroid of cluster j,

  • N is the total number of data points,

  • k is the number of clusters.

  1. Plot WCSS against Number of Clusters k: The graph will typically show a steep decrease in WCSS at first, which will flatten out as k increases. The "elbow" point, where the curve starts to flatten, indicates the optimal number of clusters.

Interpreting the Elbow:

  • Optimal k: The "elbow" is where increasing the number of clusters doesn't substantially reduce the WCSS. This point marks the ideal number of clusters.

Advantages:

  • Simple to implement.

  • Visual method for selecting the optimal number of clusters.

Limitations:

  • The "elbow" is sometimes ambiguous, and there might not be a clear point where the curve flattens.


2. Silhouette Score: Measuring Cluster Cohesion and Separation

The Silhouette Score is a measure of how well-separated the clusters are. It evaluates how similar each point is to its own cluster compared to other clusters.

How it Works:

For each data point, the silhouette score is calculated using two values:

  • Cohesion (a(i)): The average distance between the point i and all other points in the same cluster.

  • Separation (b(i)): The average distance between the point i and all points in the nearest neighboring cluster.

The silhouette score for a point i is calculated as:

S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where:

  • a(i) is the average distance from point i to all points in the same cluster.

  • b(i) is the average distance from point i to the points in the nearest cluster.

Interpreting the Silhouette Score:

  • The score ranges from -1 to +1:

    • +1: Indicates that the point is well-clustered (far from neighboring clusters).

    • 0: Indicates that the point is on or near the boundary between two clusters.

    • -1: Indicates that the point is misclassified (closer to points in a different cluster).

Advantages:

  • Gives an overall sense of the quality of clustering.

  • Can highlight misclassified points.

Limitations:

  • Computationally expensive for large datasets.

  • Sensitive to the choice of distance metric.


3. Davies-Bouldin Index: Evaluating Cluster Compactness and Separation

The Davies-Bouldin Index (DBI) is another method to evaluate the quality of clusters by considering both their compactness and separation.

How it Works:

The Davies-Bouldin Index is calculated by evaluating each cluster pair i and j using two factors:

  1. Compactness (scatter) Si​: The average distance between points in cluster iii and its centroid.

  2. Separation d(i,j): The distance between the centroids of clusters i and j.

The formula for DBI is:

DBI = \frac{1}{N} \sum_{i=1}^{N} \max_{j \neq i} \left( \frac{S_i + S_j}{d(i, j)} \right)

Where:

  • Si​ is the compactness of cluster i,

  • d(i,j) is the distance between the centroids of clusters i and j,

  • N is the total number of clusters.

Interpreting the Davies-Bouldin Index:

  • The lower the DBI, the better the clustering, as it indicates that the clusters are compact and well-separated.

  • Ideal DBI: A DBI close to 0 is ideal, as it represents distinct, well-separated clusters.

Advantages:

  • Simple to calculate and interpret.

  • Penalizes poor separation and compactness of clusters.

Limitations:

  • Assumes spherical clusters with similar sizes, which may not be ideal for all datasets.

  • May not perform well with clusters of differing shapes.


4. PCA Visualization: Reducing Dimensions to Visualize Clusters

When working with high-dimensional data, visualizing the clusters can be challenging. Principal Component Analysis (PCA) is a technique used to reduce the number of dimensions in a dataset, while retaining as much variance as possible.

How it Works:

PCA transforms the data into a new coordinate system, where each axis (principal component) represents a direction of maximum variance. The first few components usually capture most of the variance in the data, which can be visualized in 2D or 3D.

The steps for using PCA for clustering visualization are:

  1. Fit PCA: Apply PCA to reduce the data’s dimensions to 2 or 3 principal components.

  2. Plot the Reduced Data: Once reduced, the data can be plotted in 2D or 3D. Points can be colored based on their cluster assignments.

Why PCA is Useful for Clustering:

  • Dimensionality Reduction: PCA helps simplify the visualization of high-dimensional data.

  • Cluster Separation: After applying PCA, you can visually inspect how well-separated the clusters are in the reduced space.

Mathematical Formula for PCA:

PCA works by finding the eigenvectors and eigenvalues of the covariance matrix C of the data:

C = \frac{1}{n-1} X^T X

Where:

  • X is the data matrix.

  • C is the covariance matrix.

PCA then selects the top eigenvectors corresponding to the largest eigenvalues to form the new coordinate system.

Advantages:

  • Effective for visualizing high-dimensional data.

  • Helps you quickly check the separation of clusters in 2D/3D space.

Limitations:

  • PCA does not always find the best clustering separation, especially if the clusters are not linearly separable.

  • Information may be lost during dimensionality reduction.

Last updated 5 months ago

Was this helpful?