k-Nearest Neighbors
k-Nearest Neighbors, often abbreviated as kNN, is a simple yet powerful machine learning algorithm used for both classification and regression tasks. It is based on the principle of similarity, which assumes that similar data points tend to have similar labels or values.
In kNN, when you want to classify a new data point, the algorithm looks at the k nearest data points in the training dataset, where k is a hyperparameter
you can choose. These nearest neighbors are determined based on a similarity metric, often Euclidean distance, but other distance measures can be used as well.
The classification or prediction for the new data point is determined by a majority vote (for classification) or an average (for regression) of the labels or values of its k nearest neighbors. In essence, kNN makes predictions based on the majority class or the average value of its closest data points in the feature space.
Key points to remember about kNN:
It's a non-parametric and lazy learning algorithm, meaning it doesn't make strong assumptions about the data's underlying distribution.
The choice of the
k hyperparameter
can significantly impact the model's performance and must be carefully selected through validation techniques.kNN is sensitive to the choice of distance metric and feature scaling, so preprocessing the data is crucial.
It's often used for tasks such as recommendation systems, image classification, and anomaly detection, among others.
In this tutorial, we'll explore how to implement kNN using Scikit-Learn. We will explore the essentials of kNN, including data import, exploration, preprocessing, model building, evaluation, and comparison with other classification methods like Logistic Regression, Support Vector Machines and Decision Trees. By the end of this tutorial, you will have a solid understanding of how kNN works and how to use it in real-world applications.
Here are the Initial packages we will be working with. Note that I will be adding necessary scikit learn packages as needed
1. Import Data
The first step in any machine learning project is to import the dataset. In this tutorial, we'll use the load_wine
dataset from Scikit-Learn, which contains information about different wines.
Display metadata, the first 650 characters, and the target name, i.e. classes:
Convert data into a pandas Dataframe
Displaying the fist 5 rows reveals that the features in the dataset exhibit varying scales, for instance magnesium levels range from 100 to 127, and flavanoids levels range from 0.61 to 0.76. This could potentially lead to challenges when working with distance-sensitive algorithms like K-Nearest Neighbors or Support Vector Machines, or Gradient Descent-driven techniques such as Logistic Regression due to the susceptibility of these methods to the data point ranges.
2. Exploratory Data Analysis
Before diving into modeling, it's crucial to understand your data. We'll explore the dataset to gain insights into its structure, attributes, and any potential challenges.
wines_df.info()
yields that:
There are 178 data samples, and 14 columns including the target (one we want to predict)
There are NO missing values in any of the columns
All features are float64, and the target is int64 data types.
The dataset's memory usage is 19.6 KB
Once we have the generic info about the dataset, we can also explore the descriptive statistics about each feature in the dataset:
Pandas' describe()
method shows range of each feature in the dataset which discussed above that need to be addressed by scaling the dataset prior to building our model.
Next we will create a scatterplot matrix, also known as a pairs plot, using Pandas' scatter_matrix
function. Scatterplot matrix is a grid of scatterplots that allows us to visualize the relationships between multiple variables in a dataset. It's particularly useful for understanding the pairwise relationships and correlations between numerical columns in a dataset.
Two important parameter used in the scatter plot are the color (c) and the transparency (alpha), which allow us to easily distinguish the target variables in the graphs. It is a great practice to utilize them whenever we can in our analyses.
The pairwise plots help us easily determine the correlation between the two features. For instance, we can clearly see the strong correlation between the features total_phenols
and flavonoids
and therefore it might be a good to use only one of them in our analysis, a procedure called feature selection
and is commonly used in machine learning applications (we will be keeping all the features in this tutorial). The scatterplot matrix also shows us the distribution of each feature in the diagonal. By default, it shows the histogram of variables, but we can set it to display kernel density estimation by adding the parameter diagonal='kde'
to the functions' parameters. From the histograms, we can say that ash
and alcalinity_of_ash
are the only two parameters that seem to be normally distributed, and most of the dataset are right skewed.
3. Preprocessing & Scaling
We've gained a good grasp of what our data looks like, and reaching this point typically signals that we're ready to begin preparing the data for use in a machine learning model.
Data preprocessing stands as a crucial stage in the machine learning workflow because real-world data is often quite messy. It might exhibit various issues, including
missing values,
redundant entries,
outliers,
errors,
and noise.
Addressing these concerns is essential before feeding the data to a machine learning model. Otherwise, the model could inadvertently learn from these issues and make mistakes when presented with new data – this is encapsulated in the famous adage, "Garbage in, garbage out."
Apart from our data having different scales, there don't appear to be any major issues upon initial inspection.
In machine learning, when it comes to scaling the dataset, it is generally recommended to split the dataset into a training set and a test (or validation) set before applying any scaling transformations. The reason for this is to prevent data leakage and ensure that your model generalizes well to unseen data. The following steps are typically followed:
Split the dataset
Scale and transform the training set
Transform the test set
To tackle the scaling problem, we will employ sklearn's StandardScaler
class to standardize the features. This process will ensure that the mean of each feature is centered around zero, and the variance is set to 1.
3.1. Split the dataset
Divide your dataset into a training set and a test set (or a validation set). The training set is used to train your machine learning model, and the test set is used to evaluate its performance.
Output:
In the split phase, we haven't specified the stratify
argument, default is None
, which allow us to have the same ratio of labels in both train and test sets as they are in the original dataset. Try and see how if it will impact your results!
3.2. Scale the training set
Apply the scaling transformations (e.g., mean-centering and standardization) to the features in your training set. This helps ensure that the data has a consistent scale and that your machine learning algorithms perform better. For this tutorial, we will be using StandardScaler which will ensure that each feature will have the mean value of 0 and the variance of 1, bringing all features to the same magnitude.
Note that I transposed, columns and rows swapped, the dataframe for better visualization (i.e. to fit all features to the screen). The magnesium levels now range from -0.841477 to 2.294697 in the first five rows.
3.3. Transform the test set
Use the scaling parameters (such as the mean and standard deviation) calculated from the training set to transform the test set. Do not recompute the scaling parameters using the test set, as this can introduce data leakage.
We are now ready to build our model.
4. Building the Model
Now, we'll build our kNN model. We will first use a value of 3 for 'k', with the intention of refining it later. However before even doing that, we need to instantiate the kNN model. We will then train it with our training data, by providing it with both the features and the target variable to enable the model to acquire the necessary information, hence the name 'supervised learning'.
Output:
Minkowski Distance:
A generalization of both Euclidean and Manhattan distances.
It includes a parameter (p) that allows you to choose the distance metric. When p=2, it is equivalent to Euclidean distance, and when p=1, it is equivalent to Manhattan distance.
You can check kNN Parameters page for more information on all parameters used in the algorithm.
5. Make Prediction & Calculate Accuracy
With our model in place, we'll use it to make predictions and evaluate its accuracy. We'll also discuss the concept of accuracy as an evaluation metric.
The accuracy for the test set, unseen data, is about 0.963. This means that with k = 3 our model is accurate about 96.3%, i.e. the model predicted the class correctly for 96.3% of the samples in the unseen, test, dataset.
6. Finding the Best Value for k (Hyperparameter Tuning)
k
, the number of nearest neighbors, is a crucial hyperparameter in kNN. We'll explore methods for selecting the best value for k:
6.1. Selecting multiple k values
We'll try different values of k and evaluate their impact on model performance.
We have tried 15 k-values, ranging 1 to 15, and the best accuracy we obtained is 0.981 for six different k
values such as 7,8,9,12,14, and 15. With visual inspection of the plot or by printing the accuracies_k
dictionary we can easily select the k value of 7, which is the first k-value with the maximum accuracy. However if we were the try tens, or hundreds (or more) k values, it would then become challenging. The below code helps us find the best k-value with highest accuracy in two different ways:
Another accuracy we should consider is that the train set accuracy, which will help us better understand the model's ability to generalize. Code below shows both the train and test sets accuracies for different k values.
The plot shows us that when k=1, using only one neighbor, the prediction on the training set is perfect. As more neighbors are added to the model, which becomes simpler, the training accuracy drops, but the test accuracy increases. However, considering 10 neighbors, the model becomes too simple and performance decreases. The best performance occurs when k = 7. Keep in mind that even the worst performance is more than 94% accuracy, which may often be acceptable for most of the applications.
6.2. Using Cross Validation
Cross-validation
is a statistical method of evaluating model's generalization performance and it is more stable and complete than simply splitting data into training and test sets. In the code snippet below, we take a range of k
values and set up an empty list to store the outcomes. We employ cross-validation to calculate accuracy scores, eliminating the need for creating a training and test split. Nevertheless, we must ensure our data is properly scaled. We then iterate through the 'k' values and append the corresponding scores to our list.
For implementing cross-validation, we make use of scikit-learn's cross_val_score
function. We provide an instance of the kNN model, our dataset, and specify the number of splits to perform. In the code below, we opt for five splits, which means the data is divided into five equal-sized groups, with four groups used for training and one for testing in each iteration. The model cycles through each group, generating an accuracy score for each, which we subsequently average to determine the best model.
The highest accuracy we can obtain is 0.967, which can be attained by using 7 or 8 nearest neighbors values. Therefore, we will take 7 as our best value for k, and can conclude that with k=7 our model is expected to be around 97% accurate on average.
6.2.1. Grid Search Cross Validation - GridSearchCV
We can also combine both methods using sklearn.model_selection.GridSearchCV
, which will yield a cleaner look in the code:
Output:
7. Other Evaluation Metrics
7.1. Accuracy vs Precision vs Recall
Accuracy
is not the only evaluation metric. We'll also need to check precision and recall parameters to understand our model's success in determining the classes from the unseen data.
Precision: Precision is the ratio of correctly predicted positive instances to the total instances predicted as positive. In other words, it measures the accuracy of the positive predictions made by the model. A higher precision indicates fewer false positives.
Precision = TP / (TP + FP)
Where:
TP (True Positives) is the number of instances correctly predicted as positive.
FP (False Positives) is the number of instances incorrectly predicted as positive (i.e., negative instances that were predicted as positive)
Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive instances to the total actual positive instances in the dataset. It measures the model's ability to find all positive instances. A higher recall indicates fewer false negatives.
Recall = TP / (TP + FN)
Where:
TP (True Positives) is the number of instances correctly predicted as positive.
FN (False Negatives) is the number of instances incorrectly predicted as negative (i.e., positive instances that were predicted as negative).
Let's re-build our model with a k value of 7, and calculate the scores.
Output:
7.2. Confusion Matrix
A confusion matrix is a fundamental tool for evaluating the performance of classification models and provides a concise summary of how well a model's predictions match the actual outcomes, making it an essential component of model evaluation.
According to our confusion matrix, our kNN model only missed one data point that belongs to the class 1, it predicted as class 0 instead. Precision and Recall values per class are as follows
7.3. Classification Reports
A classification report provides several important metrics for evaluating the performance of a classification model. These metrics include:
Precision: measures the accuracy of positive predictions.
Recall: measures the model's ability to find all positive instances.
F1-Score: balances precision and recall.
Support: indicates the number of instances in each class.
Accuracy: measures the overall correctness of predictions.
Macro Avg: calculates unweighted averages of class-specific metrics and treats all classes equally.
Weighted Avg: calculates class-specific metrics with a weighted average based on the number of instances in each class.
We generate classification reports using scikit-learn's classification_report
function:
The classification report summarizes our earlier findings in terms overall accuracy, and precision and recall values per class.
8. Classification with Other Algorithms for Comparison
To provide a holistic view of classification, we'll compare kNN with other classification methods:
Logistic Regression
Support Vector Machines
Decision Trees
Output:
8.1. Evaluations using Classification Reports
Time to compare models' performances using classification reports of kNN, Logistic Regression, Support Vector Machines, and Decision Tree Classifier. This will allow us to compare their performance and understand the pros and cons of each method.
Output:
Note: For more detailed information of each parameter you can check the Classification Report page.
From the output, it appears that kNN, Logistic Regression, and Support Vector Machines models perform equally well, giving us the flexibility to select any of them. However, it's essential to have a deeper understanding of these models and the knowledge they acquire during the learning process. This deeper insight will provide us with a clearer understanding of their respective strengths and weaknesses. Having this knowledge is immensely valuable to stakeholders, as it empowers them to devise solutions to address areas where the model may have limitations.
Conclusion
With the inspection of classification report, we have concluded our first kNN classification task. In this tutorial, we used one of Scikit-Learn's dataset, called wines
, explored how to use Scikit-Learn to preprocess datasets, scaling in particular, implemented kNN algorithm to classify the dataset, and learned how to fine-tune kNN's hyperparameter k
for optimal performance.
Last updated