Andrea Grianti
Andrea Grianti

Reputation: 50

KNN K-Nearest Neighbors : train_test_split and knn.kneighbors

I ask the following general question to those with experience (which I've not) in using the function train_test_split in library sklearn together with KNN (K-Nearest Neighbors) algorithm and in particular the knn.kneighbors function.

Assume you have samples in a very standard pandas dataframe (no indexed) made of these columns:

So you have let's say 100 rows of generic data.

When you invoke the function train_test_split you pass, as parameters, the columns with features (like df[['X1','X2','X3']]) and the column with target (like df['Y1']) and in return you get the 4 variables X_test, X_train, y_test, y_train splitted in random way.

ok. so far so good.

Assume that, after that, you use one the algorithm KNN to make predictions on test data. So you issue commands like:

knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)

ok. fine. y_pred contains the predictions.

Now, here's the question, you want to see who are the ‘neighbors’ of the X_train data points that have made possible the predictions.

For that there’s a function called Knn.Kneighbors who return an array of the distances and coordinates (like array of X1 X2 X3) of the k points that were considered ‘neighbors’ to each of the data point of the X_train set.

neighbors=knn.kneighbors(X=X_test)

Question is: how do you link the data represented by the coordinates returned in the neighbors variable as an array of features, with the original dataset to understand to who (=> column 'Name of person') these coordinates belong ?

I mean: for each row of the original dataset there was a 'Name of a person' associated. you don’t pass this to X_train or X_test. So can I re-link the array of neighbors returned by knn.kneighbors function (which now are randomly mixed and with no reference to original data) to the original data set ?

Is there an easy way to re-link back ? I would like to know in the end what is the name of the neighbors of the points in the X_train data and not just the anonymous array of coordinates that the function knn.kneighbors returns.

Otherwise I'm forced to take the neighbors cooordinates in a loop in the original data set to know to who they belonged... but I hope not to do it.

Thanks to all in advance. Andrea

Upvotes: 0

Views: 2712

Answers (1)

akilat90
akilat90

Reputation: 5696

The output of the function knn.kneighbors(X=X_test) is more readable if you would set return_distance=False. In that case, each row in the resulting array represents the indices of n_neighbors number of nearest neighbors for each point (row) in X_test.

Note that these indices correspond to the indices in the training set X_train. If you want to map them back to the Name column in the original data frame, I think you must make use of pandas indexes.

I hope the below example makes sense.


Creating the data set:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

np.random.seed(42)  # for repoducibility
df = pd.DataFrame(np.random.randn(20, 3),
                  columns=["X1", "X2", "X3"])
df["Name"] = df.index.values * 100  # assume the names are just pandas index * 100
Y = np.random.randint(0, 2, 20)  # targets

print(df)

    X1           X2         X3          Name
0   0.496714    -0.138264   0.647689    0
1   1.523030    -0.234153   -0.234137   100
2   1.579213    0.767435    -0.469474   200
3   0.542560    -0.463418   -0.465730   300
4   0.241962    -1.913280   -1.724918   400
5   -0.562288   -1.012831   0.314247    500
6   -0.908024   -1.412304   1.465649    600
7   -0.225776   0.067528    -1.424748   700
8   -0.544383   0.110923    -1.150994   800
9   0.375698    -0.600639   -0.291694   900
10  -0.601707   1.852278    -0.013497   1000
11  -1.057711   0.822545    -1.220844   1100
12  0.208864    -1.959670   -1.328186   1200
13  0.196861    0.738467    0.171368    1300
14  -0.115648   -0.301104   -1.478522   1400
15  -0.719844   -0.460639   1.057122    1500
16  0.343618    -1.763040   0.324084    1600
17  -0.385082   -0.676922   0.611676    1700
18  1.031000    0.931280    -0.839218   1800
19  -0.309212   0.331263    0.975545    1900

Do the train test split:

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :3],
                                                    Y,
                                                    random_state=24  # for reproducibility
                                                   )

Notice the index of each data frame:

print(X_train)

          X1        X2        X3
8   0.375698 -0.600639 -0.291694
14 -0.115648 -0.301104 -1.478522
16  0.343618 -1.763040  0.324084
7  -0.225776  0.067528 -1.424748
10 -0.601707  1.852278 -0.013497
12  0.208864 -1.959670 -1.328186
19 -0.309212  0.331263  0.975545
18  1.031000  0.931280 -0.839218
15 -0.719844 -0.460639  1.057122
11 -1.057711  0.822545 -1.220844
4   0.241962 -1.913280 -1.724918
1   1.523030 -0.234153 -0.234137
0   0.496714 -0.138264  0.647689
3   0.542560 -0.463418 -0.465730
2   1.579213  0.767435 -0.469474

print(X_test)

          X1        X2        X3
13  0.196861  0.738467  0.171368
6  -0.908024 -1.412304  1.465649
17 -0.385082 -0.676922  0.611676
5  -0.562288 -1.012831  0.314247
9   0.375698 -0.600639 -0.291694

Since we have ensured reproducibility by setting random seeds, let's make a change that will help us understand the result of knn.kneighbors(X=X_test). I'm setting the first row in X_train to be the same as that of last row of X_test. With those two points being identical, when we query for X_test.loc[[9]] (or X_test.iloc[4, :]) it should return itself as the closest point.

Notice the first row with index 8 has been changed and that is equal to the last row of X_test:

X_train.loc[8]  = X_test.loc[9]
print(X_train)

          X1        X2        X3
8   0.375698 -0.600639 -0.291694
14 -0.115648 -0.301104 -1.478522
16  0.343618 -1.763040  0.324084
7  -0.225776  0.067528 -1.424748
10 -0.601707  1.852278 -0.013497
12  0.208864 -1.959670 -1.328186
19 -0.309212  0.331263  0.975545
18  1.031000  0.931280 -0.839218
15 -0.719844 -0.460639  1.057122
11 -1.057711  0.822545 -1.220844
4   0.241962 -1.913280 -1.724918
1   1.523030 -0.234153 -0.234137
0   0.496714 -0.138264  0.647689
3   0.542560 -0.463418 -0.465730
2   1.579213  0.767435 -0.469474

Training the KNN Model:

knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)

To make things simple, let's get the nearest neighbors of a one point (same explanation applies for multiple points).

Obtaining the two nearest neighbors for the specific point X_test.loc[[9]] = [ 0.375698 -0.600639 -0.291694] which we've used above to change X_train):

nn_indices = knn.kneighbors(X=X_test.loc[[9]], return_distance=False)
print(nn_indices)
[[ 0 13]]

which are:

print(X_train.iloc[np.squeeze(nn_indices)])

         X1        X2        X3
8  0.375698 -0.600639 -0.291694  < - Same point in X_train
3  0.542560 -0.463418 -0.465730  < - Second closest point in X_train

This means that the rows 0 and 13 in X_train are the closest to the point [ 0.375698 -0.600639 -0.291694]

In order to map them to names, you can use:

print(df["Name"][np.squeeze(X_train.index.values[nn_indices])])

8    800
3    300
Name: Name, dtype: int64

If you haven't set return_distance=False, you will notice that the first distance value is zero (the distance to a point that happens to be itself is zero)

nn_distances, nn_indices = knn.kneighbors(X=X_test.loc[[9]])
print(nn_distances)

[[0.         0.27741858]] 

You can also make use of the n_neighbors argument to get out more nearest neighbors. By default it will run for value used while fitting the model.

Edit:

For the entire X_test, you could do:

nn_indices = knn.kneighbors(X=X_test, return_distance=False)
pd.DataFrame(nn_indices).applymap(lambda x: df["Name"][X_train.index.values[x]])

    0       1
0   1900    0
1   1500    1600
2   1500    0
3   1500    1600
4   800     300

Upvotes: 1

Related Questions