k-Nearest Neighbors

k-Nearest Neighbors, often abbreviated as kNN, is a simple yet powerful machine learning algorithm used for both classification and regression tasks. It is based on the principle of similarity, which assumes that similar data points tend to have similar labels or values.

In kNN, when you want to classify a new data point, the algorithm looks at the k nearest data points in the training dataset, where k is a hyperparameter you can choose. These nearest neighbors are determined based on a similarity metric, often Euclidean distance, but other distance measures can be used as well.

The classification or prediction for the new data point is determined by a majority vote (for classification) or an average (for regression) of the labels or values of its k nearest neighbors. In essence, kNN makes predictions based on the majority class or the average value of its closest data points in the feature space.

Key points to remember about kNN:

It's a non-parametric and lazy learning algorithm, meaning it doesn't make strong assumptions about the data's underlying distribution.
The choice of the k hyperparameter can significantly impact the model's performance and must be carefully selected through validation techniques.
kNN is sensitive to the choice of distance metric and feature scaling, so preprocessing the data is crucial.
It's often used for tasks such as recommendation systems, image classification, and anomaly detection, among others.

In this tutorial, we'll explore how to implement kNN using Scikit-Learn. We will explore the essentials of kNN, including data import, exploration, preprocessing, model building, evaluation, and comparison with other classification methods like Logistic Regression, Support Vector Machines and Decision Trees. By the end of this tutorial, you will have a solid understanding of how kNN works and how to use it in real-world applications.

Here are the Initial packages we will be working with. Note that I will be adding necessary scikit learn packages as needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. Import Data

The first step in any machine learning project is to import the dataset. In this tutorial, we'll use the load_wine dataset from Scikit-Learn, which contains information about different wines.

from sklearn.datasets import load_wine 

# Instantiate data object, which returns dictionary-like object
wines = load_wine()

# Display keys
wines.keys()

# Output:
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

Display metadata, the first 650 characters, and the target name, i.e. classes:

print(wines['DESCR'][:650] + '\n...')

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
          
...

wines.target_names

# Output
array(['class_0', 'class_1', 'class_2'], dtype='<U7')

Convert data into a pandas Dataframe

# Convert data to pandas dataframe
wines_df = pd.DataFrame(data=wines['data'], columns=wines['feature_names'])

# Add the target label
wines_df["target"] = wines['target']

# Display first 5 rows
wines_df.head()

Displaying the fist 5 rows reveals that the features in the dataset exhibit varying scales, for instance magnesium levels range from 100 to 127, and flavanoids levels range from 0.61 to 0.76. This could potentially lead to challenges when working with distance-sensitive algorithms like K-Nearest Neighbors or Support Vector Machines, or Gradient Descent-driven techniques such as Logistic Regression due to the susceptibility of these methods to the data point ranges.

2. Exploratory Data Analysis

Before diving into modeling, it's crucial to understand your data. We'll explore the dataset to gain insights into its structure, attributes, and any potential challenges.

wines_df.info()

wines_df.info() yields that:

There are 178 data samples, and 14 columns including the target (one we want to predict)
There are NO missing values in any of the columns
All features are float64, and the target is int64 data types.
The dataset's memory usage is 19.6 KB

Once we have the generic info about the dataset, we can also explore the descriptive statistics about each feature in the dataset:

# Get descriptive statistics about each feature in the dataset
wines_df.describe().T

Pandas' describe() method shows range of each feature in the dataset which discussed above that need to be addressed by scaling the dataset prior to building our model.

Next we will create a scatterplot matrix, also known as a pairs plot, using Pandas' scatter_matrix function. Scatterplot matrix is a grid of scatterplots that allows us to visualize the relationships between multiple variables in a dataset. It's particularly useful for understanding the pairwise relationships and correlations between numerical columns in a dataset.

# create scatterplot matrix
ax = pd.plotting.scatter_matrix(wines_df.iloc[:,:-1], 
                                c = wines['target'],  #color
                                figsize=(15,15), 
                                marker='*',
                                alpha=0.5
                               )

# Update axis labels
for i in ax.flatten():
    # update rotation of labels
    i.xaxis.label.set_rotation(90)
    i.yaxis.label.set_rotation(0)

    # align y axis label to right
    i.yaxis.label.set_ha('right')

    # update size of labels
    i.yaxis.label.set_fontsize(12)
    i.xaxis.label.set_fontsize(12)

# plot the graph
plt.show()

Two important parameter used in the scatter plot are the color (c) and the transparency (alpha), which allow us to easily distinguish the target variables in the graphs. It is a great practice to utilize them whenever we can in our analyses.

The pairwise plots help us easily determine the correlation between the two features. For instance, we can clearly see the strong correlation between the features total_phenols and flavonoids and therefore it might be a good to use only one of them in our analysis, a procedure called feature selection and is commonly used in machine learning applications (we will be keeping all the features in this tutorial). The scatterplot matrix also shows us the distribution of each feature in the diagonal. By default, it shows the histogram of variables, but we can set it to display kernel density estimation by adding the parameter diagonal='kde' to the functions' parameters. From the histograms, we can say that ash and alcalinity_of_ash are the only two parameters that seem to be normally distributed, and most of the dataset are right skewed.

3. Preprocessing & Scaling

We've gained a good grasp of what our data looks like, and reaching this point typically signals that we're ready to begin preparing the data for use in a machine learning model.

Data preprocessing stands as a crucial stage in the machine learning workflow because real-world data is often quite messy. It might exhibit various issues, including

missing values,
redundant entries,
outliers,
errors,
and noise.

Addressing these concerns is essential before feeding the data to a machine learning model. Otherwise, the model could inadvertently learn from these issues and make mistakes when presented with new data – this is encapsulated in the famous adage, "Garbage in, garbage out."

Apart from our data having different scales, there don't appear to be any major issues upon initial inspection.

In machine learning, when it comes to scaling the dataset, it is generally recommended to split the dataset into a training set and a test (or validation) set before applying any scaling transformations. The reason for this is to prevent data leakage and ensure that your model generalizes well to unseen data. The following steps are typically followed:

Split the dataset
Scale and transform the training set
Transform the test set

To tackle the scaling problem, we will employ sklearn's StandardScaler class to standardize the features. This process will ensure that the mean of each feature is centered around zero, and the variance is set to 1.

3.1. Split the dataset

Divide your dataset into a training set and a test set (or a validation set). The training set is used to train your machine learning model, and the test set is used to evaluate its performance.

from sklearn.model_selection import train_test_split
# Split data into features and label
features = wines['feature_names'] 
X = wines_df[features] 
y = wines_df["target"]

#Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=.7, 
                                                    random_state=42)

# Check the sizes
print(f"Original Dataset dimensions: {X.shape}
\n\nTrain set dimensions: {X_train.shape}
\nTrain set size percentage: {round(len(X_train) / len(X) * 100)}%
\n\nTest set dimensions: {X_test.shape}
\nTest set size percentage: {round(len(X_test) / len(X) * 100)}%\n")

Output:

Original Dataset dimensions: (178, 13)

Train set dimensions: (124, 13)
Train set size percentage: 70%

Test set dimensions: (54, 13)
Test set size percentage: 30%

In the split phase, we haven't specified the stratify argument, default is None, which allow us to have the same ratio of labels in both train and test sets as they are in the original dataset. Try and see how if it will impact your results!

3.2. Scale the training set

Apply the scaling transformations (e.g., mean-centering and standardization) to the features in your training set. This helps ensure that the data has a consistent scale and that your machine learning algorithms perform better. For this tutorial, we will be using StandardScaler which will ensure that each feature will have the mean value of 0 and the variance of 1, bringing all features to the same magnitude.

from sklearn.preprocessing import StandardScaler

# Instantiate scaler and fit on features
scaler = StandardScaler()
scaler.fit(X_train)

# Transform features in the training set
X_train_scaled = scaler.transform(X_train)

# # fit and transform can also be done at the same time using fit_transform
# X_train_scaled = scaler.fit_transform(X_train)

# Convert X_train_scaled to a dataframe
features = wines['feature_names']
X_train_scaled_df = pd.DataFrame(data=X_train_scaled, columns=features)

# Display first 5 rows
X_train_scaled_df.head().T

Note that I transposed, columns and rows swapped, the dataframe for better visualization (i.e. to fit all features to the screen). The magnesium levels now range from -0.841477 to 2.294697 in the first five rows.

3.3. Transform the test set

Use the scaling parameters (such as the mean and standard deviation) calculated from the training set to transform the test set. Do not recompute the scaling parameters using the test set, as this can introduce data leakage.

# Transform features in the test set
X_test_scaled = scaler.transform(X_test)

# Convert X_test_scaled to a dataframe
X_test_scaled_df = pd.DataFrame(data=X_test_scaled, columns=features)

# Display first 5 rows
X_test_scaled_df.head().T

We are now ready to build our model.

4. Building the Model

Now, we'll build our kNN model. We will first use a value of 3 for 'k', with the intention of refining it later. However before even doing that, we need to instantiate the kNN model. We will then train it with our training data, by providing it with both the features and the target variable to enable the model to acquire the necessary information, hence the name 'supervised learning'.

from sklearn.neighbors import KNeighborsClassifier

# Create an instance of the kNN model
knn = KNeighborsClassifier(n_neighbors=3)

# Build the model
knn.fit(X_train_scaled,y_train)

# Display model parameters
knn.get_params()

Output:

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}

Minkowski Distance:

A generalization of both Euclidean and Manhattan distances.
It includes a parameter (p) that allows you to choose the distance metric. When p=2, it is equivalent to Euclidean distance, and when p=1, it is equivalent to Manhattan distance.

You can check kNN Parameters page for more information on all parameters used in the algorithm.

5. Make Prediction & Calculate Accuracy

With our model in place, we'll use it to make predictions and evaluate its accuracy. We'll also discuss the concept of accuracy as an evaluation metric.

from sklearn.metrics import accuracy_score

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# or using KNeighborsClassifier's 'score' method
# knn.score(X_test_scaled,y_test)
# or using numpy's mean: 
# accuracy = np.mean(np.array(y_test)==y_pred)

print("Accuracy:", accuracy)

# Output
# Accuracy: 0.9629629629629629

The accuracy for the test set, unseen data, is about 0.963. This means that with k = 3 our model is accurate about 96.3%, i.e. the model predicted the class correctly for 96.3% of the samples in the unseen, test, dataset.

6. Finding the Best Value for k (Hyperparameter Tuning)

k, the number of nearest neighbors, is a crucial hyperparameter in kNN. We'll explore methods for selecting the best value for k:

6.1. Selecting multiple k values

We'll try different values of k and evaluate their impact on model performance.

# Select k values
k_max=15
k_values = list(range(1,k_max+1))
accuracies_k = {}

for k in k_values:
    # Create an instance of the kNN model
    knn = KNeighborsClassifier(n_neighbors=k)

    # Build the model
    knn.fit(X_train_scaled,y_train)

    # Make predictions
    y_pred = knn.predict(X_test_scaled)

    # Calculate accuracy
    accuracy = np.mean(y_pred==y_test)

    # Insert into dict
    accuracies_k[k]=accuracy

# Plot Data
fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = accuracies_k.keys(), y = accuracies_k.values(), marker = 'o', ax=ax)
ax.set_title("k vs Accuracy")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.9,1.025)
ax.set_xlim(0,k_max+1)

# Annotate data points
for k,v in accuracies_k.items():
    ax.annotate(text=round(v,3),
                xy=(k,v),
                textcoords='offset points',
                xytext=(1,3.5),
                ha='center')

We have tried 15 k-values, ranging 1 to 15, and the best accuracy we obtained is 0.981 for six different k values such as 7,8,9,12,14, and 15. With visual inspection of the plot or by printing the accuracies_k dictionary we can easily select the k value of 7, which is the first k-value with the maximum accuracy. However if we were the try tens, or hundreds (or more) k values, it would then become challenging. The below code helps us find the best k-value with highest accuracy in two different ways:

# Best k value
best_k = max(accuracies_k, key=accuracies_k.get)

# or
best_k = 0
best_acc = 0
for k,v in accuracies_k.items():
    if v > best_acc:
        best_k=k
        best_acc=v
        
knn_accuracy = round(best_acc,5) # will be used later

# Print the best k with highest accuracy
print(best_k, best_acc)

Output: 
7 0.9814814814814815

Another accuracy we should consider is that the train set accuracy, which will help us better understand the model's ability to generalize. Code below shows both the train and test sets accuracies for different k values.

training_accuracy = []
test_accuracy = []
k_max = 15
number_of_neighbors = range(1,k_max+1)

for k in number_of_neighbors:
    # build the model
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train_scaled, y_train)

    # store training set accuracy
    training_accuracy.append(knn.score(X_train_scaled, y_train))
    # store test set accuracy
    test_accuracy.append(knn.score(X_test_scaled, y_test))

fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = number_of_neighbors, y = training_accuracy, marker = 'o', ax=ax, label = 'training accuracy')
sns.lineplot(x = number_of_neighbors, y = test_accuracy, marker = 'o', ax=ax, label = 'test accuracy')

ax.set_title("k vs Accuracy")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.94,1.01)
ax.set_xlim(0,k_max+1)
ax.legend()

plt.show()

The plot shows us that when k=1, using only one neighbor, the prediction on the training set is perfect. As more neighbors are added to the model, which becomes simpler, the training accuracy drops, but the test accuracy increases. However, considering 10 neighbors, the model becomes too simple and performance decreases. The best performance occurs when k = 7. Keep in mind that even the worst performance is more than 94% accuracy, which may often be acceptable for most of the applications.

6.2. Using Cross Validation

Cross-validation is a statistical method of evaluating model's generalization performance and it is more stable and complete than simply splitting data into training and test sets. In the code snippet below, we take a range of k values and set up an empty list to store the outcomes. We employ cross-validation to calculate accuracy scores, eliminating the need for creating a training and test split. Nevertheless, we must ensure our data is properly scaled. We then iterate through the 'k' values and append the corresponding scores to our list.

For implementing cross-validation, we make use of scikit-learn's cross_val_score function. We provide an instance of the kNN model, our dataset, and specify the number of splits to perform. In the code below, we opt for five splits, which means the data is divided into five equal-sized groups, with four groups used for training and one for testing in each iteration. The model cycles through each group, generating an accuracy score for each, which we subsequently average to determine the best model.

from sklearn.model_selection import cross_val_score

k_max = 15
k_values = [i for i in range (1,k_max+1)]
scores = {}

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    # cv:Determines the cross-validation splitting strategy, default=None to use the default 5-fold cross validation
    score = cross_val_score(knn, X, y, cv=None) 
    scores[k] = np.mean(score)

# Plot Data
fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = scores.keys(), y = scores.values(), marker = 'o', ax = ax)
ax.set_title("k vs Accuracy (using Cross-Validation)")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.9,1.025)
ax.set_xlim(0,k_max+1)

# Annotate data points
for k,v in scores.items():
    ax.annotate(text=round(v,3),
                xy=(k,v),
                textcoords='offset points',
                xytext=(1,3.5),
                ha='center')

The highest accuracy we can obtain is 0.967, which can be attained by using 7 or 8 nearest neighbors values. Therefore, we will take 7 as our best value for k, and can conclude that with k=7 our model is expected to be around 97% accurate on average.

6.2.1. Grid Search Cross Validation - GridSearchCV

We can also combine both methods using sklearn.model_selection.GridSearchCV, which will yield a cleaner look in the code:

from sklearn.model_selection import GridSearchCV

# Define the grid
param_grid = {'n_neighbors': np.arange(1,50)}

# Build the model using GridSearchCV
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X , y)

# Display best hyperparameter value along with its mean cross validation score
print(knn_cv.best_params_)
print(knn_cv.best_score_)

Output:

{'n_neighbors': 7}
0.9665079365079364

7. Other Evaluation Metrics

7.1. Accuracy vs Precision vs Recall

Accuracy is not the only evaluation metric. We'll also need to check precision and recall parameters to understand our model's success in determining the classes from the unseen data.

Precision: Precision is the ratio of correctly predicted positive instances to the total instances predicted as positive. In other words, it measures the accuracy of the positive predictions made by the model. A higher precision indicates fewer false positives.

Precision = TP / (TP + FP)

Where:

TP (True Positives) is the number of instances correctly predicted as positive.
FP (False Positives) is the number of instances incorrectly predicted as positive (i.e., negative instances that were predicted as positive)

Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive instances to the total actual positive instances in the dataset. It measures the model's ability to find all positive instances. A higher recall indicates fewer false negatives.

Recall = TP / (TP + FN)

Where:

TP (True Positives) is the number of instances correctly predicted as positive.
FN (False Negatives) is the number of instances incorrectly predicted as negative (i.e., positive instances that were predicted as negative).

Let's re-build our model with a k value of 7, and calculate the scores.

# Built the model
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Calculate scores
acc_score = accuracy_score(y_test,y_pred)
prec_score = precision_score(y_test,y_pred, average='macro')  # average : {'micro', 'macro', 'samples', 'weighted', 'binary'}
rec_score = recall_score(y_test,y_pred, average='macro')

print("Accuracy:", acc_score)
print("Precision:", prec_score)
print("Recall:", rec_score)

Output:

Accuracy: 0.9814814814814815
Precision: 0.9833333333333334
Recall: 0.9841269841269842

7.2. Confusion Matrix

A confusion matrix is a fundamental tool for evaluating the performance of classification models and provides a concise summary of how well a model's predictions match the actual outcomes, making it an essential component of model evaluation.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=knn.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=knn.classes_)
disp.plot()
plt.show()

According to our confusion matrix, our kNN model only missed one data point that belongs to the class 1, it predicted as class 0 instead. Precision and Recall values per class are as follows

Precision  = TP / (TP+FP)
Recall_0 = TP / (TP+FN)

Precision_0 = 19 / (19+1) = 19/20 = 0.95
Recall_0 = 19 / (19+0) = 1.00

Precision_1 = 20 / (20+0) = 1.00
Recall_1 = 20 / (20+1) = 0.95

Precision_2 = 14 / (14+0) = 1.00
Recall_2 = 14 / (14+14) = 0.95

7.3. Classification Reports

A classification report provides several important metrics for evaluating the performance of a classification model. These metrics include:

Precision: measures the accuracy of positive predictions.
Recall: measures the model's ability to find all positive instances.
F1-Score: balances precision and recall.
Support: indicates the number of instances in each class.
Accuracy: measures the overall correctness of predictions.
Macro Avg: calculates unweighted averages of class-specific metrics and treats all classes equally.
Weighted Avg: calculates class-specific metrics with a weighted average based on the number of instances in each class.

We generate classification reports using scikit-learn's classification_report function:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=wines['target_names']))

The classification report summarizes our earlier findings in terms overall accuracy, and precision and recall values per class.

8. Classification with Other Algorithms for Comparison

To provide a holistic view of classification, we'll compare kNN with other classification methods:

Logistic Regression
Support Vector Machines
Decision Trees

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


# Instantiate the models
log_reg = LogisticRegression()
svm = SVC()
dt = DecisionTreeClassifier()

# Build models
log_reg.fit(X_train_scaled, y_train)
svm.fit(X_train_scaled, y_train)
dt.fit(X_train_scaled, y_train)

# Make predictions
log_reg_pred = log_reg.predict(X_test_scaled)
svm_pred = svm.predict(X_test_scaled)
dt_pred = dt.predict(X_test_scaled)


# Check accuracies
log_reg_accuracy = round(accuracy_score(y_test, log_reg_pred),5)
svm_accuracy = round(accuracy_score(y_test, svm_pred),5)
dt_accuracy = round(accuracy_score(y_test, dt_pred),5)
print(f"Accuracies:\n\
=================================\n\
 k-Nearest Neighbor:      {knn_accuracy}\n\
 Logistic Regression:     {log_reg_accuracy}\n\
 Support Vector Machines: {svm_accuracy}\n\
 Decision Tree:           {dt_accuracy}\n\
=================================\n")

Output:

Accuracies:
=================================
 k-Nearest Neighbor:      0.98148
 Logistic Regression:     0.98148
 Support Vector Machines: 0.98148
 Decision Tree:           0.96296
=================================

8.1. Evaluations using Classification Reports

Time to compare models' performances using classification reports of kNN, Logistic Regression, Support Vector Machines, and Decision Tree Classifier. This will allow us to compare their performance and understand the pros and cons of each method.

from sklearn.metrics import classification_report

# create a dictionary  for model predictions
model_predictions = {
    "k-Nearest Neighbor":y_pred,
    "Logistic Regression": log_reg_pred,
    "Support Vector Machines": svm_pred,
    "Decision Trees": dt_pred
}

for model, pred in model_predictions.items():
    print(f"{model} \nResults:\n{classification_report(y_test, pred)}\
\n-----------------------------------------------------\n")

Output:

k-Nearest Neighbor 
Results:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.95      0.98        21
           2       1.00      1.00      1.00        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Logistic Regression 
Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.95      0.98        21
           2       0.93      1.00      0.97        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Support Vector Machines 
Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.95      1.00      0.98        21
           2       1.00      0.93      0.96        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Decision Trees 
Results:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97        19
           1       0.91      1.00      0.95        21
           2       1.00      0.93      0.96        14

    accuracy                           0.96        54
   macro avg       0.97      0.96      0.96        54
weighted avg       0.97      0.96      0.96        54

-----------------------------------------------------

Note: For more detailed information of each parameter you can check the Classification Report page.

From the output, it appears that kNN, Logistic Regression, and Support Vector Machines models perform equally well, giving us the flexibility to select any of them. However, it's essential to have a deeper understanding of these models and the knowledge they acquire during the learning process. This deeper insight will provide us with a clearer understanding of their respective strengths and weaknesses. Having this knowledge is immensely valuable to stakeholders, as it empowers them to devise solutions to address areas where the model may have limitations.

Conclusion

With the inspection of classification report, we have concluded our first kNN classification task. In this tutorial, we used one of Scikit-Learn's dataset, called wines, explored how to use Scikit-Learn to preprocess datasets, scaling in particular, implemented kNN algorithm to classify the dataset, and learned how to fine-tune kNN's hyperparameter k for optimal performance.

Last updated 9 months ago

Was this helpful?

k-Nearest Neighbors

Key points to remember about kNN:

It's a non-parametric and lazy learning algorithm, meaning it doesn't make strong assumptions about the data's underlying distribution.
The choice of the k hyperparameter can significantly impact the model's performance and must be carefully selected through validation techniques.
kNN is sensitive to the choice of distance metric and feature scaling, so preprocessing the data is crucial.
It's often used for tasks such as recommendation systems, image classification, and anomaly detection, among others.

Here are the Initial packages we will be working with. Note that I will be adding necessary scikit learn packages as needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. Import Data

The first step in any machine learning project is to import the dataset. In this tutorial, we'll use the load_wine dataset from Scikit-Learn, which contains information about different wines.

from sklearn.datasets import load_wine 

# Instantiate data object, which returns dictionary-like object
wines = load_wine()

# Display keys
wines.keys()

# Output:
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

Display metadata, the first 650 characters, and the target name, i.e. classes:

print(wines['DESCR'][:650] + '\n...')

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
          
...

wines.target_names

# Output
array(['class_0', 'class_1', 'class_2'], dtype='<U7')

Convert data into a pandas Dataframe

# Convert data to pandas dataframe
wines_df = pd.DataFrame(data=wines['data'], columns=wines['feature_names'])

# Add the target label
wines_df["target"] = wines['target']

# Display first 5 rows
wines_df.head()

2. Exploratory Data Analysis

Before diving into modeling, it's crucial to understand your data. We'll explore the dataset to gain insights into its structure, attributes, and any potential challenges.

wines_df.info()

wines_df.info() yields that:

There are 178 data samples, and 14 columns including the target (one we want to predict)
There are NO missing values in any of the columns
All features are float64, and the target is int64 data types.
The dataset's memory usage is 19.6 KB

Once we have the generic info about the dataset, we can also explore the descriptive statistics about each feature in the dataset:

# Get descriptive statistics about each feature in the dataset
wines_df.describe().T

Pandas' describe() method shows range of each feature in the dataset which discussed above that need to be addressed by scaling the dataset prior to building our model.

# create scatterplot matrix
ax = pd.plotting.scatter_matrix(wines_df.iloc[:,:-1], 
                                c = wines['target'],  #color
                                figsize=(15,15), 
                                marker='*',
                                alpha=0.5
                               )

# Update axis labels
for i in ax.flatten():
    # update rotation of labels
    i.xaxis.label.set_rotation(90)
    i.yaxis.label.set_rotation(0)

    # align y axis label to right
    i.yaxis.label.set_ha('right')

    # update size of labels
    i.yaxis.label.set_fontsize(12)
    i.xaxis.label.set_fontsize(12)

# plot the graph
plt.show()

3. Preprocessing & Scaling

We've gained a good grasp of what our data looks like, and reaching this point typically signals that we're ready to begin preparing the data for use in a machine learning model.

Data preprocessing stands as a crucial stage in the machine learning workflow because real-world data is often quite messy. It might exhibit various issues, including

missing values,
redundant entries,
outliers,
errors,
and noise.

Apart from our data having different scales, there don't appear to be any major issues upon initial inspection.

Split the dataset
Scale and transform the training set
Transform the test set

3.1. Split the dataset

Divide your dataset into a training set and a test set (or a validation set). The training set is used to train your machine learning model, and the test set is used to evaluate its performance.

from sklearn.model_selection import train_test_split
# Split data into features and label
features = wines['feature_names'] 
X = wines_df[features] 
y = wines_df["target"]

#Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=.7, 
                                                    random_state=42)

# Check the sizes
print(f"Original Dataset dimensions: {X.shape}
\n\nTrain set dimensions: {X_train.shape}
\nTrain set size percentage: {round(len(X_train) / len(X) * 100)}%
\n\nTest set dimensions: {X_test.shape}
\nTest set size percentage: {round(len(X_test) / len(X) * 100)}%\n")

Output:

Original Dataset dimensions: (178, 13)

Train set dimensions: (124, 13)
Train set size percentage: 70%

Test set dimensions: (54, 13)
Test set size percentage: 30%

3.2. Scale the training set

from sklearn.preprocessing import StandardScaler

# Instantiate scaler and fit on features
scaler = StandardScaler()
scaler.fit(X_train)

# Transform features in the training set
X_train_scaled = scaler.transform(X_train)

# # fit and transform can also be done at the same time using fit_transform
# X_train_scaled = scaler.fit_transform(X_train)

# Convert X_train_scaled to a dataframe
features = wines['feature_names']
X_train_scaled_df = pd.DataFrame(data=X_train_scaled, columns=features)

# Display first 5 rows
X_train_scaled_df.head().T

3.3. Transform the test set

# Transform features in the test set
X_test_scaled = scaler.transform(X_test)

# Convert X_test_scaled to a dataframe
X_test_scaled_df = pd.DataFrame(data=X_test_scaled, columns=features)

# Display first 5 rows
X_test_scaled_df.head().T

We are now ready to build our model.

4. Building the Model

from sklearn.neighbors import KNeighborsClassifier

# Create an instance of the kNN model
knn = KNeighborsClassifier(n_neighbors=3)

# Build the model
knn.fit(X_train_scaled,y_train)

# Display model parameters
knn.get_params()

Output:

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}

Minkowski Distance:

A generalization of both Euclidean and Manhattan distances.
It includes a parameter (p) that allows you to choose the distance metric. When p=2, it is equivalent to Euclidean distance, and when p=1, it is equivalent to Manhattan distance.

You can check kNN Parameters page for more information on all parameters used in the algorithm.

5. Make Prediction & Calculate Accuracy

With our model in place, we'll use it to make predictions and evaluate its accuracy. We'll also discuss the concept of accuracy as an evaluation metric.

from sklearn.metrics import accuracy_score

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# or using KNeighborsClassifier's 'score' method
# knn.score(X_test_scaled,y_test)
# or using numpy's mean: 
# accuracy = np.mean(np.array(y_test)==y_pred)

print("Accuracy:", accuracy)

# Output
# Accuracy: 0.9629629629629629

6. Finding the Best Value for k (Hyperparameter Tuning)

k, the number of nearest neighbors, is a crucial hyperparameter in kNN. We'll explore methods for selecting the best value for k:

6.1. Selecting multiple k values

We'll try different values of k and evaluate their impact on model performance.

# Select k values
k_max=15
k_values = list(range(1,k_max+1))
accuracies_k = {}

for k in k_values:
    # Create an instance of the kNN model
    knn = KNeighborsClassifier(n_neighbors=k)

    # Build the model
    knn.fit(X_train_scaled,y_train)

    # Make predictions
    y_pred = knn.predict(X_test_scaled)

    # Calculate accuracy
    accuracy = np.mean(y_pred==y_test)

    # Insert into dict
    accuracies_k[k]=accuracy

# Plot Data
fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = accuracies_k.keys(), y = accuracies_k.values(), marker = 'o', ax=ax)
ax.set_title("k vs Accuracy")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.9,1.025)
ax.set_xlim(0,k_max+1)

# Annotate data points
for k,v in accuracies_k.items():
    ax.annotate(text=round(v,3),
                xy=(k,v),
                textcoords='offset points',
                xytext=(1,3.5),
                ha='center')

# Best k value
best_k = max(accuracies_k, key=accuracies_k.get)

# or
best_k = 0
best_acc = 0
for k,v in accuracies_k.items():
    if v > best_acc:
        best_k=k
        best_acc=v
        
knn_accuracy = round(best_acc,5) # will be used later

# Print the best k with highest accuracy
print(best_k, best_acc)

Output: 
7 0.9814814814814815

training_accuracy = []
test_accuracy = []
k_max = 15
number_of_neighbors = range(1,k_max+1)

for k in number_of_neighbors:
    # build the model
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train_scaled, y_train)

    # store training set accuracy
    training_accuracy.append(knn.score(X_train_scaled, y_train))
    # store test set accuracy
    test_accuracy.append(knn.score(X_test_scaled, y_test))

fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = number_of_neighbors, y = training_accuracy, marker = 'o', ax=ax, label = 'training accuracy')
sns.lineplot(x = number_of_neighbors, y = test_accuracy, marker = 'o', ax=ax, label = 'test accuracy')

ax.set_title("k vs Accuracy")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.94,1.01)
ax.set_xlim(0,k_max+1)
ax.legend()

plt.show()

6.2. Using Cross Validation

from sklearn.model_selection import cross_val_score

k_max = 15
k_values = [i for i in range (1,k_max+1)]
scores = {}

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    # cv:Determines the cross-validation splitting strategy, default=None to use the default 5-fold cross validation
    score = cross_val_score(knn, X, y, cv=None) 
    scores[k] = np.mean(score)

# Plot Data
fig, ax = plt.subplots(figsize=(10,5))
sns.lineplot(x = scores.keys(), y = scores.values(), marker = 'o', ax = ax)
ax.set_title("k vs Accuracy (using Cross-Validation)")
ax.set_xlabel("k Values")
ax.set_ylabel("Accuracy Score")
ax.set_ylim(0.9,1.025)
ax.set_xlim(0,k_max+1)

# Annotate data points
for k,v in scores.items():
    ax.annotate(text=round(v,3),
                xy=(k,v),
                textcoords='offset points',
                xytext=(1,3.5),
                ha='center')

6.2.1. Grid Search Cross Validation - GridSearchCV

We can also combine both methods using sklearn.model_selection.GridSearchCV, which will yield a cleaner look in the code:

from sklearn.model_selection import GridSearchCV

# Define the grid
param_grid = {'n_neighbors': np.arange(1,50)}

# Build the model using GridSearchCV
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X , y)

# Display best hyperparameter value along with its mean cross validation score
print(knn_cv.best_params_)
print(knn_cv.best_score_)

Output:

{'n_neighbors': 7}
0.9665079365079364

7. Other Evaluation Metrics

7.1. Accuracy vs Precision vs Recall

Accuracy is not the only evaluation metric. We'll also need to check precision and recall parameters to understand our model's success in determining the classes from the unseen data.

Precision = TP / (TP + FP)

Where:

TP (True Positives) is the number of instances correctly predicted as positive.
FP (False Positives) is the number of instances incorrectly predicted as positive (i.e., negative instances that were predicted as positive)

Recall = TP / (TP + FN)

Where:

TP (True Positives) is the number of instances correctly predicted as positive.
FN (False Negatives) is the number of instances incorrectly predicted as negative (i.e., positive instances that were predicted as negative).

Let's re-build our model with a k value of 7, and calculate the scores.

# Built the model
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Calculate scores
acc_score = accuracy_score(y_test,y_pred)
prec_score = precision_score(y_test,y_pred, average='macro')  # average : {'micro', 'macro', 'samples', 'weighted', 'binary'}
rec_score = recall_score(y_test,y_pred, average='macro')

print("Accuracy:", acc_score)
print("Precision:", prec_score)
print("Recall:", rec_score)

Output:

Accuracy: 0.9814814814814815
Precision: 0.9833333333333334
Recall: 0.9841269841269842

7.2. Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=knn.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=knn.classes_)
disp.plot()
plt.show()

According to our confusion matrix, our kNN model only missed one data point that belongs to the class 1, it predicted as class 0 instead. Precision and Recall values per class are as follows

Precision  = TP / (TP+FP)
Recall_0 = TP / (TP+FN)

Precision_0 = 19 / (19+1) = 19/20 = 0.95
Recall_0 = 19 / (19+0) = 1.00

Precision_1 = 20 / (20+0) = 1.00
Recall_1 = 20 / (20+1) = 0.95

Precision_2 = 14 / (14+0) = 1.00
Recall_2 = 14 / (14+14) = 0.95

7.3. Classification Reports

A classification report provides several important metrics for evaluating the performance of a classification model. These metrics include:

Precision: measures the accuracy of positive predictions.
Recall: measures the model's ability to find all positive instances.
F1-Score: balances precision and recall.
Support: indicates the number of instances in each class.
Accuracy: measures the overall correctness of predictions.
Macro Avg: calculates unweighted averages of class-specific metrics and treats all classes equally.
Weighted Avg: calculates class-specific metrics with a weighted average based on the number of instances in each class.

We generate classification reports using scikit-learn's classification_report function:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=wines['target_names']))

The classification report summarizes our earlier findings in terms overall accuracy, and precision and recall values per class.

8. Classification with Other Algorithms for Comparison

To provide a holistic view of classification, we'll compare kNN with other classification methods:

Logistic Regression
Support Vector Machines
Decision Trees

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


# Instantiate the models
log_reg = LogisticRegression()
svm = SVC()
dt = DecisionTreeClassifier()

# Build models
log_reg.fit(X_train_scaled, y_train)
svm.fit(X_train_scaled, y_train)
dt.fit(X_train_scaled, y_train)

# Make predictions
log_reg_pred = log_reg.predict(X_test_scaled)
svm_pred = svm.predict(X_test_scaled)
dt_pred = dt.predict(X_test_scaled)


# Check accuracies
log_reg_accuracy = round(accuracy_score(y_test, log_reg_pred),5)
svm_accuracy = round(accuracy_score(y_test, svm_pred),5)
dt_accuracy = round(accuracy_score(y_test, dt_pred),5)
print(f"Accuracies:\n\
=================================\n\
 k-Nearest Neighbor:      {knn_accuracy}\n\
 Logistic Regression:     {log_reg_accuracy}\n\
 Support Vector Machines: {svm_accuracy}\n\
 Decision Tree:           {dt_accuracy}\n\
=================================\n")

Output:

Accuracies:
=================================
 k-Nearest Neighbor:      0.98148
 Logistic Regression:     0.98148
 Support Vector Machines: 0.98148
 Decision Tree:           0.96296
=================================

8.1. Evaluations using Classification Reports

from sklearn.metrics import classification_report

# create a dictionary  for model predictions
model_predictions = {
    "k-Nearest Neighbor":y_pred,
    "Logistic Regression": log_reg_pred,
    "Support Vector Machines": svm_pred,
    "Decision Trees": dt_pred
}

for model, pred in model_predictions.items():
    print(f"{model} \nResults:\n{classification_report(y_test, pred)}\
\n-----------------------------------------------------\n")

Output:

k-Nearest Neighbor 
Results:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.95      0.98        21
           2       1.00      1.00      1.00        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Logistic Regression 
Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.95      0.98        21
           2       0.93      1.00      0.97        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Support Vector Machines 
Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.95      1.00      0.98        21
           2       1.00      0.93      0.96        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

-----------------------------------------------------

Decision Trees 
Results:
              precision    recall  f1-score   support

           0       1.00      0.95      0.97        19
           1       0.91      1.00      0.95        21
           2       1.00      0.93      0.96        14

    accuracy                           0.96        54
   macro avg       0.97      0.96      0.96        54
weighted avg       0.97      0.96      0.96        54

-----------------------------------------------------

Note: For more detailed information of each parameter you can check the Classification Report page.

Conclusion

Last updated 9 months ago

Was this helpful?