Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page

Was this helpful?

  1. Machine Learning
  2. Fundamentals
  3. Preprocessing

Feature Selection vs Dimensionality Reduction

Feature Selection and Dimensionality Reduction (has more applications in Unsupervised Learning) are related but distinct concepts in machine learning. While they both aim to reduce the number of features in a dataset, they differ in their approaches and goals:

Feature Selection:

  • Selects a subset of the original features that are most relevant to the problem.

  • Goal: Identify the most informative features that improve model performance.

  • Methods: Filter methods (e.g., correlation analysis), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO).

Dimensionality Reduction:

  • Transforms the original features into a new set of features that capture the most important information.

  • Goal: Reduce the number of features while preserving the underlying structure and relationships.

  • Methods: Linear methods (e.g., PCA, LLE), non-linear methods (e.g., t-SNE, autoencoders), and manifold learning methods.

Key differences:

  • Feature selection selects a subset of the original features, while dimensionality reduction creates new features.

  • Feature selection focuses on identifying the most informative features, while dimensionality reduction aims to preserve the underlying structure and relationships.

To illustrate the difference, let's consider a dataset with features like height, weight, and age. Feature selection might select only height and weight as the most informative features, while dimensionality reduction (e.g., PCA) might create a new feature that combines height and weight into a single feature, capturing the underlying correlation between them.

To give a real life use-case for supervised learning, suppose we're building a classification model to predict whether a customer will churn from a telecom company based on their usage patterns. Our dataset has 100 features, including:

  • Call minutes

  • Text messages sent

  • Data usage

  • Number of international calls

  • ...

  • Average call duration on Mondays

  • Average data usage on weekends

However, many of these features are correlated or redundant, making it difficult to train an effective model. We can apply dimensionality reduction techniques, such as Principal Component Analysis (PCA), to reduce the number of features while preserving the most important information. After applying PCA, we might retain only the top 10 features that explain the most variance in the data, such as:

  • Call minutes

  • Data usage

  • Number of international calls

  • Average call duration

  • ...

  • Top 5 features explaining the most variance

By reducing the dimensionality from 100 features to 10, we simplify the model, reduce overfitting, and improve training time, while still retaining the essential information for making accurate predictions.

In supervised learning, dimensionality reduction helps:

  • Reduce the risk of overfitting

  • Improve model interpretability

  • Speed up training and testing

  • Identify the most important features

Keep in mind that dimensionality reduction is not always necessary, and it's important to carefully evaluate the impact on model performance and interpretability. While feature selection and dimensionality reduction can be used together, they serve distinct purposes in the machine learning applications.

Last updated 1 year ago

Was this helpful?