Magnus Vivadeepa
Magnus Vivadeepa

Reputation: 57

LeaveOneOut to determine k of knn

I want to know the best k for k-nearest-neighbor. I am using LeaveOneOut to divide my data into train and test sets. In the code below I have 150 data entries, so I get 150 different train and test sets. K should be in-between 1 and 40.

I want to plot the cross-validation average classification error as a function of k, too see which k is the best for KNN.

Here is my code:

import scipy.io as sio
import seaborn as sn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import LeaveOneOut    
error = []
array = np.array(range(1,41))

dataset = pd.read_excel('Data/iris.xls')
X = dataset.iloc[:, :-1].values  
y = dataset.iloc[:, 4].values

loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    #print(X_train, X_test, y_train, y_test)

    for i in range(1, 41):  
        classifier = KNeighborsClassifier(n_neighbors=i)  
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_test)
        error.append(np.mean(y_pred != y_test))

plt.figure(figsize=(12, 6))  
plt.plot(range(1, 41), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')  
plt.xlabel('K Value')  
plt.ylabel('Mean Error')

Upvotes: 0

Views: 2317

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

You are calculating error at each prediction, thats why you have 6000 points in your error array. You need to collect the predictions of all points in the fold for a given 'n_neighbors' and then calculate the error for that value.

You can do this:

# Loop over possible values of "n_neighbors"
for i in range(1, 41):  

    # Collect the actual and predicted values for all splits for a single "n_neighbors"
    actual = []
    predicted = []


    for train_index, test_index in loo.split(X):
        #print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        classifier = KNeighborsClassifier(n_neighbors=i)  
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_test)

        # Append the single predictions and actual values here.
        actual.append(y_test[0])
        predicted.append(y_pred[0])

    # Outside the loop, calculate the error.
    error.append(np.mean(np.array(predicted) != np.array(actual))) 

Rest of your code is okay.

There is a more compact way to do this if you use the cross_val_predict

from sklearn.model_selection import cross_val_predict

for i in range(1, 41):  

    classifier = KNeighborsClassifier(n_neighbors=i)  
    y_pred = cross_val_predict(classifier, X, y, cv=loo)
    error.append(np.mean(y_pred != y))

Upvotes: 1

Related Questions