Feature Selection vs Dimensionality Reduction

Feature Selection and Dimensionality Reduction (has more applications in Unsupervised Learning) are related but distinct concepts in machine learning. While they both aim to reduce the number of features in a dataset, they differ in their approaches and goals:

Feature Selection:

  • Selects a subset of the original features that are most relevant to the problem.

  • Goal: Identify the most informative features that improve model performance.

  • Methods: Filter methods (e.g., correlation analysis), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., LASSO).

Dimensionality Reduction:

  • Transforms the original features into a new set of features that capture the most important information.

  • Goal: Reduce the number of features while preserving the underlying structure and relationships.

  • Methods: Linear methods (e.g., PCA, LLE), non-linear methods (e.g., t-SNE, autoencoders), and manifold learning methods.

Key differences:

  • Feature selection selects a subset of the original features, while dimensionality reduction creates new features.

  • Feature selection focuses on identifying the most informative features, while dimensionality reduction aims to preserve the underlying structure and relationships.

To illustrate the difference, let's consider a dataset with features like height, weight, and age. Feature selection might select only height and weight as the most informative features, while dimensionality reduction (e.g., PCA) might create a new feature that combines height and weight into a single feature, capturing the underlying correlation between them.

To give a real life use-case for supervised learning, suppose we're building a classification model to predict whether a customer will churn from a telecom company based on their usage patterns. Our dataset has 100 features, including:

  • Call minutes

  • Text messages sent

  • Data usage

  • Number of international calls

  • ...

  • Average call duration on Mondays

  • Average data usage on weekends

However, many of these features are correlated or redundant, making it difficult to train an effective model. We can apply dimensionality reduction techniques, such as Principal Component Analysis (PCA), to reduce the number of features while preserving the most important information. After applying PCA, we might retain only the top 10 features that explain the most variance in the data, such as:

  • Call minutes

  • Data usage

  • Number of international calls

  • Average call duration

  • ...

  • Top 5 features explaining the most variance

By reducing the dimensionality from 100 features to 10, we simplify the model, reduce overfitting, and improve training time, while still retaining the essential information for making accurate predictions.

In supervised learning, dimensionality reduction helps:

  • Reduce the risk of overfitting

  • Improve model interpretability

  • Speed up training and testing

  • Identify the most important features

Keep in mind that dimensionality reduction is not always necessary, and it's important to carefully evaluate the impact on model performance and interpretability. While feature selection and dimensionality reduction can be used together, they serve distinct purposes in the machine learning applications.

Last updated