Reputation: 50
I ask the following general question to those with experience (which I've not) in using the function train_test_split in library sklearn together with KNN (K-Nearest Neighbors) algorithm and in particular the knn.kneighbors function.
Assume you have samples in a very standard pandas dataframe (no indexed) made of these columns:
So you have let's say 100 rows of generic data.
When you invoke the function train_test_split you pass, as parameters, the columns with features (like df[['X1','X2','X3']]) and the column with target (like df['Y1']) and in return you get the 4 variables X_test, X_train, y_test, y_train splitted in random way.
ok. so far so good.
Assume that, after that, you use one the algorithm KNN to make predictions on test data. So you issue commands like:
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
ok. fine. y_pred contains the predictions.
Now, here's the question, you want to see who are the ‘neighbors’ of the X_train data points that have made possible the predictions.
For that there’s a function called Knn.Kneighbors who return an array of the distances and coordinates (like array of X1 X2 X3) of the k points that were considered ‘neighbors’ to each of the data point of the X_train set.
neighbors=knn.kneighbors(X=X_test)
Question is: how do you link the data represented by the coordinates returned in the neighbors variable as an array of features, with the original dataset to understand to who (=> column 'Name of person') these coordinates belong ?
I mean: for each row of the original dataset there was a 'Name of a person' associated. you don’t pass this to X_train or X_test. So can I re-link the array of neighbors returned by knn.kneighbors function (which now are randomly mixed and with no reference to original data) to the original data set ?
Is there an easy way to re-link back ? I would like to know in the end what is the name of the neighbors of the points in the X_train data and not just the anonymous array of coordinates that the function knn.kneighbors returns.
Otherwise I'm forced to take the neighbors cooordinates in a loop in the original data set to know to who they belonged... but I hope not to do it.
Thanks to all in advance. Andrea
Upvotes: 0
Views: 2712
Reputation: 5696
The output of the function knn.kneighbors(X=X_test)
is more readable if you would set return_distance=False
. In that case, each row in the resulting array represents the indices of n_neighbors
number of nearest neighbors for each point (row) in X_test
.
Note that these indices correspond to the indices in the training set X_train
. If you want to map them back to the Name
column in the original data frame, I think you must make use of pandas indexes.
I hope the below example makes sense.
Creating the data set:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(42) # for repoducibility
df = pd.DataFrame(np.random.randn(20, 3),
columns=["X1", "X2", "X3"])
df["Name"] = df.index.values * 100 # assume the names are just pandas index * 100
Y = np.random.randint(0, 2, 20) # targets
print(df)
X1 X2 X3 Name
0 0.496714 -0.138264 0.647689 0
1 1.523030 -0.234153 -0.234137 100
2 1.579213 0.767435 -0.469474 200
3 0.542560 -0.463418 -0.465730 300
4 0.241962 -1.913280 -1.724918 400
5 -0.562288 -1.012831 0.314247 500
6 -0.908024 -1.412304 1.465649 600
7 -0.225776 0.067528 -1.424748 700
8 -0.544383 0.110923 -1.150994 800
9 0.375698 -0.600639 -0.291694 900
10 -0.601707 1.852278 -0.013497 1000
11 -1.057711 0.822545 -1.220844 1100
12 0.208864 -1.959670 -1.328186 1200
13 0.196861 0.738467 0.171368 1300
14 -0.115648 -0.301104 -1.478522 1400
15 -0.719844 -0.460639 1.057122 1500
16 0.343618 -1.763040 0.324084 1600
17 -0.385082 -0.676922 0.611676 1700
18 1.031000 0.931280 -0.839218 1800
19 -0.309212 0.331263 0.975545 1900
Do the train test split:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :3],
Y,
random_state=24 # for reproducibility
)
Notice the index of each data frame:
print(X_train)
X1 X2 X3
8 0.375698 -0.600639 -0.291694
14 -0.115648 -0.301104 -1.478522
16 0.343618 -1.763040 0.324084
7 -0.225776 0.067528 -1.424748
10 -0.601707 1.852278 -0.013497
12 0.208864 -1.959670 -1.328186
19 -0.309212 0.331263 0.975545
18 1.031000 0.931280 -0.839218
15 -0.719844 -0.460639 1.057122
11 -1.057711 0.822545 -1.220844
4 0.241962 -1.913280 -1.724918
1 1.523030 -0.234153 -0.234137
0 0.496714 -0.138264 0.647689
3 0.542560 -0.463418 -0.465730
2 1.579213 0.767435 -0.469474
print(X_test)
X1 X2 X3
13 0.196861 0.738467 0.171368
6 -0.908024 -1.412304 1.465649
17 -0.385082 -0.676922 0.611676
5 -0.562288 -1.012831 0.314247
9 0.375698 -0.600639 -0.291694
Since we have ensured reproducibility by setting random seeds, let's make a change that will help us understand the result of knn.kneighbors(X=X_test)
. I'm setting the first row in X_train
to be the same as that of last row of X_test
. With those two points being identical, when we query for X_test.loc[[9]]
(or X_test.iloc[4, :]
) it should return itself as the closest point.
Notice the first row with index 8 has been changed and that is equal to the last row of X_test
:
X_train.loc[8] = X_test.loc[9]
print(X_train)
X1 X2 X3
8 0.375698 -0.600639 -0.291694
14 -0.115648 -0.301104 -1.478522
16 0.343618 -1.763040 0.324084
7 -0.225776 0.067528 -1.424748
10 -0.601707 1.852278 -0.013497
12 0.208864 -1.959670 -1.328186
19 -0.309212 0.331263 0.975545
18 1.031000 0.931280 -0.839218
15 -0.719844 -0.460639 1.057122
11 -1.057711 0.822545 -1.220844
4 0.241962 -1.913280 -1.724918
1 1.523030 -0.234153 -0.234137
0 0.496714 -0.138264 0.647689
3 0.542560 -0.463418 -0.465730
2 1.579213 0.767435 -0.469474
Training the KNN Model:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
To make things simple, let's get the nearest neighbors of a one point (same explanation applies for multiple points).
Obtaining the two nearest neighbors for the specific point X_test.loc[[9]] = [ 0.375698 -0.600639 -0.291694]
which we've used above to change X_train
):
nn_indices = knn.kneighbors(X=X_test.loc[[9]], return_distance=False)
print(nn_indices)
[[ 0 13]]
which are:
print(X_train.iloc[np.squeeze(nn_indices)])
X1 X2 X3
8 0.375698 -0.600639 -0.291694 < - Same point in X_train
3 0.542560 -0.463418 -0.465730 < - Second closest point in X_train
This means that the rows 0
and 13
in X_train
are the closest to the point [ 0.375698 -0.600639 -0.291694]
In order to map them to names, you can use:
print(df["Name"][np.squeeze(X_train.index.values[nn_indices])])
8 800
3 300
Name: Name, dtype: int64
If you haven't set return_distance=False
, you will notice that the first distance value is zero (the distance to a point that happens to be itself is zero)
nn_distances, nn_indices = knn.kneighbors(X=X_test.loc[[9]])
print(nn_distances)
[[0. 0.27741858]]
You can also make use of the n_neighbors
argument to get out more nearest neighbors. By default it will run for value used while fitting the model.
Edit:
For the entire X_test
, you could do:
nn_indices = knn.kneighbors(X=X_test, return_distance=False)
pd.DataFrame(nn_indices).applymap(lambda x: df["Name"][X_train.index.values[x]])
0 1
0 1900 0
1 1500 1600
2 1500 0
3 1500 1600
4 800 300
Upvotes: 1