Encoding
Encoding is the process of transforming categorical data into a numerical format that machine learning algorithms can interpret. Most algorithms work with numerical inputs, so encoding is essential when working with categorical variables. There are two common methods for encoding:
1. Label Encoding:
Converts categories into unique integers.
Each unique category is assigned an integer in the range [0, n_categories - 1]
Example: ["Red", "Green", "Blue"] → [0, 1, 2]
Suitable for ordinal data (e.g., "Low", "Medium", "High").
Simple and memory-efficient
May introduce unintended ordinal relationships in nominal data, i.e. non-ordinal categorical data, leading to poor performance with algorithms like Linear Regression or K-means.
2. One-Hot Encoding:
Creates binary columns for each category, indicating presence (
1
) or absence (0
).Converts each unique category into a separate binary column (also known as "dummy variables")
Example: ["Red", "Green", "Blue"] → [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
Suitable for nominal data, i.e. no inherit order (e.g., "Red", "Blue", "Green").
Increases dimensionality but prevents ordinal misinterpretation.
Prevents introducing ordinal relationships into non-ordinal data.
Good for algorithms that expect numeric input but don't assume any ordinal relationship (e.g., linear regression, neural networks, clustering).
Works well with many machine learning models.
Increases dimensionality significantly when the categorical variable has many unique values.
Can lead to a sparse dataset.
3. Key Differences
Feature
Label Encoding
One-Hot Encoding
Output
Single integer column
Multiple binary columns (one per category)
Type of Data
Ordinal or nominal
Nominal only
Ordinal Relationship
May impose unintended order
No ordinal relationship implied
Dimensionality
Low (1 column per feature)
High (one column for each category)
Use Cases
Decision trees, ordinal features
Linear regression, neural networks
4. Why Encoding Matters
Encoding ensures that categorical features are properly represented numerically, preserving their inherent characteristics while making them compatible with machine learning models. The choice of encoding depends on the type of data and the model requirements.
Last updated