Increasing performance of nearest neighbors of rows in Pandas

Question

I am given 8000x3 data set similar to this one:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(8000,3), columns=list('XYZ'))

So for a visual reference, df.head(5) looks like this:

          X         Y         Z
0  0.462433  0.559442  0.016778
1  0.663771  0.092044  0.636519
2  0.111489  0.676621  0.839845
3  0.244361  0.599264  0.505175
4  0.115844  0.888622  0.766014

I'm trying to implement a method that when given an index from the dataset, it will return similar items from the dataset (in some reasonable way). For now I have:

def find_similiar_items(item_id):
    tmp_df = df.sub(df.loc[item_id], axis='columns')
    tmp_series = tmp_df.apply(np.square).apply(np.sum, axis=1)
    tmp_series.sort()
    return tmp_series

This method takes your row, then subtracts it from each other row in the dataframe, then calculates the norm for each row. So this method simply returns a series of the nearest points to your given point using the euclidean distance.

So you can get the nearest 5 points, for instance, with:

df.loc[find_similiar_items(5).index].head(5)

which yields:

             X         Y         Z
5     0.364020  0.380303  0.623393
4618  0.369122  0.399772  0.643603
4634  0.352484  0.402435  0.619763
5396  0.386675  0.370417  0.600555
3229  0.355186  0.410202  0.616844

The problem with this method is that it takes roughly half a second each time I call it. This isn't acceptable for my purpose, so I need to figure out how to improve the performance of this method in someway. So I have a few questions:

Question 1 Is there perhaps a more efficient way of simply calculating the euclidean distance as above?

Question 2 Is there some other technique that will yield reasonable results like this (the euclidean distance isn't import for instance). Computation time is more important than memory in this problem and pre-processing time is not important; so I would be willing, for instance, to construct a new dataframe that has the size of the Cartesian product (n^2) the original dataframe (but anything more than that might become unreasonable)

JohnE · Accepted Answer

Your biggest (and easiest) performance gain is likely to be from merely doing this in numpy rather than pandas. I'm seeing over a 200x improvement just from a quick conversion of the code to numpy:

arr = df.values
def fsi_numpy(item_id):
    tmp_arr = arr - arr[item_id]
    tmp_ser = np.sum( np.square( tmp_arr ), axis=1 )
    return tmp_ser

df['dist'] = fsi_numpy(5)
df = df.sort_values('dist').head(5)

             X         Y         Z      dist
5     0.272985  0.131939  0.449750  0.000000
5130  0.272429  0.138705  0.425510  0.000634
4609  0.264882  0.103006  0.476723  0.001630
1794  0.245371  0.175648  0.451705  0.002677
6937  0.221363  0.137457  0.463451  0.002883

Check that it gives the same result as your function (since we have different random draws):

df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)

             X         Y         Z
5     0.272985  0.131939  0.449750
5130  0.272429  0.138705  0.425510
4609  0.264882  0.103006  0.476723
1794  0.245371  0.175648  0.451705
6937  0.221363  0.137457  0.463451

Timings:

%timeit df.loc[ pd.DataFrame( find_similiar_items(5)).index].head(5)
1 loops, best of 3: 638 ms per loop

In [105]: %%timeit
     ...: df['dist'] = fsi_numpy(5)
     ...: df = df.sort_values('dist').head(5)
     ...: 
100 loops, best of 3: 2.69 ms per loop

Increasing performance of nearest neighbors of rows in Pandas

Answers (1)

Related Questions