Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • 1. Label Encoding:
  • 2. One-Hot Encoding:
  • 3. Key Differences
  • 4. Why Encoding Matters

Was this helpful?

  1. Machine Learning
  2. Fundamentals
  3. Preprocessing

Encoding

Encoding is the process of transforming categorical data into a numerical format that machine learning algorithms can interpret. Most algorithms work with numerical inputs, so encoding is essential when working with categorical variables. There are two common methods for encoding:

1. Label Encoding:

  • Converts categories into unique integers.

  • Each unique category is assigned an integer in the range [0, n_categories - 1]

    • Example: ["Red", "Green", "Blue"] → [0, 1, 2]

  • Suitable for ordinal data (e.g., "Low", "Medium", "High").

  • Simple and memory-efficient

  • May introduce unintended ordinal relationships in nominal data, i.e. non-ordinal categorical data, leading to poor performance with algorithms like Linear Regression or K-means.

2. One-Hot Encoding:

  • Creates binary columns for each category, indicating presence (1) or absence (0).

  • Converts each unique category into a separate binary column (also known as "dummy variables")

    • Example: ["Red", "Green", "Blue"] → [[1, 0, 0], [0, 1, 0], [0, 0, 1]]

  • Suitable for nominal data, i.e. no inherit order (e.g., "Red", "Blue", "Green").

  • Increases dimensionality but prevents ordinal misinterpretation.

  • Prevents introducing ordinal relationships into non-ordinal data.

  • Good for algorithms that expect numeric input but don't assume any ordinal relationship (e.g., linear regression, neural networks, clustering).

  • Works well with many machine learning models.

  • Increases dimensionality significantly when the categorical variable has many unique values.

  • Can lead to a sparse dataset.

Example Code for Both in Scikit-learn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Label Encoding
label_encoder = LabelEncoder()
df['Color_Label'] = label_encoder.fit_transform(df['Color'])

# One-Hot Encoding
onehot_encoder = OneHotEncoder() # add drop = "first" to drop the first categorical variable
color_onehot = onehot_encoder.fit_transform(df[['Color']])
df_onehot = pd.DataFrame(color_onehot, columns=onehot_encoder.get_feature_names_out(['Color']))

# Combine original and one-hot encoded data
df_combined = pd.concat([df, df_onehot], axis=1)

print("Label Encoded Data:")
print(df)

print("\nOne-Hot Encoded Data:")
print(df_combined)


"""
OUTPUT:

Label Encoded Data:
   Color  Color_Label
0    Red            2
1  Green            1
2   Blue            0
3    Red            2

One-Hot Encoded Data:
   Color  Color_Label  Color_Blue  Color_Green  Color_Red
0    Red            2         0.0          0.0        1.0
1  Green            1         0.0          1.0        0.0
2   Blue            0         1.0          0.0        0.0
3    Red            2         0.0          0.0        1.0

"""

3. Key Differences

Feature

Label Encoding

One-Hot Encoding

Output

Single integer column

Multiple binary columns (one per category)

Type of Data

Ordinal or nominal

Nominal only

Ordinal Relationship

May impose unintended order

No ordinal relationship implied

Dimensionality

Low (1 column per feature)

High (one column for each category)

Use Cases

Decision trees, ordinal features

Linear regression, neural networks

4. Why Encoding Matters

Encoding ensures that categorical features are properly represented numerically, preserving their inherent characteristics while making them compatible with machine learning models. The choice of encoding depends on the type of data and the model requirements.

Last updated 5 months ago

Was this helpful?