Reputation: 656
I have the following dataset, with over 20,000 rows:
I want to use columns A through E to predict column X using a k-nearest neighbor algorithm. I have tried to use KNeighborsRegressor
from sklearn, as follows:
import pandas as pd
import random
from numpy.random import permutation
import math
from sklearn.neighbors import KNeighborsRegressor
df = pd.read_csv("data.csv")
random_indices = permutation(df.index)
test_cutoff = int(math.floor(len(df)/5))
test = df.loc[random_indices[1:test_cutoff]]
train = df.loc[random_indices[test_cutoff:]]
x_columns = ['A', 'B', 'C', D', E']
y_column = ['X']
knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(train[x_columns], train[y_column])
predictions = knn.predict(test[x_columns])
This only makes predictions on the test data which is a fifth of the original dataset. I also want prediction values for the training data.
To do this, I tried to implement my own k-nearest algorithm by calculating the Euclidean distance for each row from every other row, finding the k shortest distances, and averaging the X value from those k rows. This process took over 30 seconds for just one row, and I have over 20,000 rows. Is there a quicker way to do this?
Upvotes: 2
Views: 3525
Reputation: 1841
You do not need to split the data into train and test if you want predictions on training data only.
You can just fit the original data then make predictions on it.
model.fit(original data, target)
model.predict(original data)
Upvotes: 0
Reputation: 13733
Give this code a try:
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.neighbors import KNeighborsRegressor
df = pd.read_csv("data.csv")
X = np.asarray(df.loc[:, ['A', 'B', 'C', 'D', 'E']])
y = np.asarray(df['X'])
rs = ShuffleSplit(n_splits=1, test_size=1./5, random_state=0)
train_indices, test_indices = rs.split(X).next()
knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(X[train_indices], y[train_indices])
predictions = knn.predict(X)
The main difference with respect to your solution is the use of ShuffleSplit
.
Notes:
predictions
contains the predicted values for all your data (test and train).test_size
(I used your setting, i.e. one fifth). next()
for the iterator to yield the indices of the train and test data.Upvotes: 1
Reputation: 66805
To do this, I tried to implement my own k-nearest algorithm by calculating the Euclidean distance for each row from every other row, finding the k shortest distances, and averaging the X value from those k rows. This process took over 30 seconds for just one row, and I have over 20,000 rows. Is there a quicker way to do this?
Yes, the problem is that loops in python are extremely slow. What you can do is vectorize your computations. So lets say that your data is in matrix X (n x d), then matrix of distances D_ij = || X_i - X_j ||^2 is
D = X^2 + X'^2 - 2 X X'
so in Python
D = (X ** 2).sum(1).reshape(-1, 1) + (X ** 2).sum(1).reshape(1, -1) - 2*X.dot(X.T)
Upvotes: 1