Imran
Imran

Reputation: 656

Using k-nearest neighbour without splitting into training and test sets

I have the following dataset, with over 20,000 rows:

enter image description here

I want to use columns A through E to predict column X using a k-nearest neighbor algorithm. I have tried to use KNeighborsRegressor from sklearn, as follows:

import pandas as pd
import random
from numpy.random import permutation
import math
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("data.csv")

random_indices = permutation(df.index)
test_cutoff = int(math.floor(len(df)/5))
test = df.loc[random_indices[1:test_cutoff]]
train = df.loc[random_indices[test_cutoff:]]

x_columns = ['A', 'B', 'C', D', E']
y_column = ['X']

knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(train[x_columns], train[y_column])
predictions = knn.predict(test[x_columns])

This only makes predictions on the test data which is a fifth of the original dataset. I also want prediction values for the training data.

To do this, I tried to implement my own k-nearest algorithm by calculating the Euclidean distance for each row from every other row, finding the k shortest distances, and averaging the X value from those k rows. This process took over 30 seconds for just one row, and I have over 20,000 rows. Is there a quicker way to do this?

Upvotes: 2

Views: 3525

Answers (3)

nitinvijay23
nitinvijay23

Reputation: 1841

You do not need to split the data into train and test if you want predictions on training data only.

You can just fit the original data then make predictions on it.

model.fit(original data, target)
model.predict(original data)

Upvotes: 0

Tonechas
Tonechas

Reputation: 13733

Give this code a try:

import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("data.csv")
X = np.asarray(df.loc[:, ['A', 'B', 'C', 'D', 'E']])
y = np.asarray(df['X'])

rs = ShuffleSplit(n_splits=1, test_size=1./5, random_state=0)
train_indices, test_indices = rs.split(X).next()

knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(X[train_indices], y[train_indices])

predictions = knn.predict(X)

The main difference with respect to your solution is the use of ShuffleSplit.

Notes:

  • predictions contains the predicted values for all your data (test and train).
  • The proportion of test data can be adjusted through the parameter test_size (I used your setting, i.e. one fifth).
  • It is necessary to call the method next() for the iterator to yield the indices of the train and test data.

Upvotes: 1

lejlot
lejlot

Reputation: 66805

To do this, I tried to implement my own k-nearest algorithm by calculating the Euclidean distance for each row from every other row, finding the k shortest distances, and averaging the X value from those k rows. This process took over 30 seconds for just one row, and I have over 20,000 rows. Is there a quicker way to do this?

Yes, the problem is that loops in python are extremely slow. What you can do is vectorize your computations. So lets say that your data is in matrix X (n x d), then matrix of distances D_ij = || X_i - X_j ||^2 is

D = X^2 + X'^2 -  2 X X'

so in Python

D = (X ** 2).sum(1).reshape(-1, 1) + (X ** 2).sum(1).reshape(1, -1) - 2*X.dot(X.T)

Upvotes: 1

Related Questions