Supervised Learning
Last updated
Last updated
Supervised learning is a type of machine learning in which an algorithm learns a mapping or relationship between input data and corresponding output labels from a labeled training dataset. In other words, it involves training a model to make predictions or classifications based on input features while having access to the correct answers during training. The goal of supervised learning is to learn a mapping function that can generalize to make accurate predictions on new, unseen data.
There are two primary types of supervised learning:
Regression:
Regression involves predicting a continuous numerical value or quantity based on input features. In regression, the output is a real number rather than a discrete category. The model learns to approximate the relationship between input variables and a target numerical value.
Examples of regression tasks include:
Predicting housing prices based on features like square footage, number of bedrooms, and location.
Forecasting stock prices based on historical market data.
Estimating the age of a person based on demographic information.
Regression algorithms include linear regression, polynomial regression, support vector regression, and various flavors of regression in machine learning libraries.
Classification:
In classification, the goal is to categorize input data into discrete classes or categories. Each data point is associated with a specific class label. The model's objective is to learn the decision boundaries that separate different classes in the feature space.
Examples of classification tasks include:
Email spam detection (classifying emails as spam or not spam).
Image classification (e.g., classifying images of animals into different species).
Sentiment analysis (classifying text as positive, negative, or neutral).
Common algorithms for classification include logistic regression, decision trees, random forests, support vector machines, and deep learning techniques like neural networks.
Some major supervised learning algorithms into based on regression and classification algorithms.
Linear Regression
Ridge
Lasso
Multiple Linear Regression
Polynomial Regression
K-Nearest Neighbors (K-NN) (also can be used for regression)
Logistic Regression
Decision Trees
Random Forests (also can be used for regression)
Gradient Boosting Machines (GBM) (also can be used for regression)
Support Vector Machines (SVM)
Naive Bayes
Gaussian
Bernoulli
Multinomial
Neural Networks (Deep Learning)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Linear Discriminant Analysis (LDA)
The key steps in supervised learning include:
Data Collection: Gathering a labeled dataset consisting of input features and corresponding output labels.
Data Preprocessing: Preparing and cleaning the data, which may involve tasks like feature scaling, handling missing values, and encoding categorical variables.
We will be using one of the most popular machine learning packages, called Scikit-learn. It is known for its easy to use interface and its range of functions and methods for building and training machine learning models.
The scikit-learn requires:
Data:
No missing values
All numeric (1 or 0 instead of Yes/No or True/False), i.e. NO categorical data
Features must be formatted as a 2D array:
either Pandas DataFrame or Numpy’s 2d array
Target should be a 1d array:
y = data['target'].values
y = np.ravel(y) -> to convert to 1d
Model Selection: Choosing an appropriate supervised learning algorithm based on the nature of the problem and the dataset.
Training: Using the labeled training data to train the selected model. During training, the model adjusts its parameters to minimize the prediction error.
Evaluation: Assessing the model's performance using evaluation metrics such as accuracy, precision, recall, mean squared error, or others, depending on the task.
Testing: Testing the trained model on unseen data (the testing dataset) to measure its ability to generalize to new examples.
Deployment: Deploying the trained model in real-world applications to make predictions or classifications.
Supervised learning is widely used in various domains, including natural language processing, computer vision, healthcare, finance, and more, where there is a need to make predictions or decisions based on historical data and labeled examples.