finding the k nearest neighbours

Question

I'm currenlty teaching myself python for data science and stumbled upon a chapter that I have been looking at for hours but I don't understand. I hope you can help me understand it. In the example they want to code the k-nearest neighbors. The code looks like this:

X = np.random.rand(10,2)
dist_sq = np.sum((X[:, np.newaxis,:] - X[np.newaxis,:,:])** 2, axis = -1)

nearest = np.argsort(dist_sq, axis=1)

K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)

 plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K=2
for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
    plt.plot(*zip(X[j], X[i]), color='black')

I do understand the premise of calculating the eucledian distance, but it is very abstract for me to understand the 3D array etc. Thank you guys for dumbing it down for me. I appreciate every answert! Thanks!

fyi: book Im learning from is python data science handbook page 89.

David Kaftan · Accepted Answer

Array broadcasting in 3 dimensions is pretty tricky to wrap your head around.

lets start with 2 dimensions:

X = np.arange(5)
X[np.newaxis,:] + 10*X[:,np.newaxis]

array([[ 0,  1,  2,  3,  4],                                                                                                   
       [10, 11, 12, 13, 14],                                                                                                   
       [20, 21, 22, 23, 24],                                                                                                   
       [30, 31, 32, 33, 34],                                                                                                   
       [40, 41, 42, 43, 44]])

As you can see, when we add a (1 x N) row vector to a (N x 1) column vector, we get a (N x N) matrix. Right before adding, the row vector becomes an (N x N) matrix where every row is the same. Similarly, the column vector becomes an (N x N) matrix where every column is the same. In some sense, this is a shorthand way of doing the following operation.

X1 = np.array([[0., 1., 2., 3., 4.],
               [0., 1., 2., 3., 4.],
               [0., 1., 2., 3., 4.],
               [0., 1., 2., 3., 4.],
               [0., 1., 2., 3., 4.]])
                                                                                                          
X2 = np.array([[ 0.,  0.,  0.,  0.,  0.],                                                                                              
               [10., 10., 10., 10., 10.],                                                                                              
               [20., 20., 20., 20., 20.],                                                                                              
               [30., 30., 30., 30., 30.],                                                                                              
               [40., 40., 40., 40., 40.]])

Clearly, X1 + X2 will get us the same answer we got before.

So how does this work in 3 dimensions? Very much the same. Before we repeated the 1st dimesion across the 2nd dimension (and vice versa).

X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2

Right before subtracting, X1's 1st and 3rd dimensions are repeated for every slice in the 2nd dimension. X2's 2nd and 3rd dimensions are repeated for every 1st dimension slice. Lets observe with some easier-to-read matricies.

X1 = np.array([[1.,2.],
               [3.,4.],
               [5.,6.]])

X2 = np.array([[10.,20.],
               [30.,40.],
               [50.,60.]])

X1[:,np.newaxis,:] + X2[np.newaxis,:,:] 

array([[[11., 22.],
        [31., 42.],
        [51., 62.]],

       [[13., 24.],
        [33., 44.],
        [53., 64.]],

       [[15., 26.],
        [35., 46.],
        [55., 66.]]])

Its easiest visually to see X2 repeated across the 1st Dimension (the blocks). In each block, we see the 10s digit is the same. Perhaps its easier to read this as a for loop of 2D broadcasting

first_dimension = []
for i_row in X1.shape[0]:
    first_dimension.append(X2 + X1[i_row,:])

Hopefully its clear now that

X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2
sq_diff = difference ** 2

sq_diff is a 3D tensor, where each slice of the 3rd dimension is the squared difference between one column of X2 and one column of X1.

ssq_diff = np.sum(sq_diff, axis = -1)

then just sums across the 3rd dimension (axis = -1 just means the last dimension in the array). Now ssq_diff is a 2D matrix, where each element is the Euclidean distance between two of the data points. For row i and column j, ssq_diff[i,j] is the euclidean distance between the ith and jth row in X.

finding the k nearest neighbours

Answers (1)

Related Questions