Larry Freeman
Larry Freeman

Reputation: 418

In Python, Need an Efficient way to map kdtree indexes to the values

I am using kdtree from scikit-learn with a very large data set.

I can get kdtree to do the query in a somewhat reasonable time (20 minutes on my machine) but I can't map the indexes to the values they represent in any time less than 1 hour (I stop waiting after 1 hour).

I load up 2 csv files (train.csv has 29M records, test.csv has 8M records). I am interested in 3 keys. 'x','y' which are floats and 'placeid' which is a string.

from sklearn.neighbors import KDTree
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read.csv("test.csv")

tree = KDTree(train[['x','y']])
_, indexes = tree.query(test[['x','y']],k=30)

# takes 20 minutes to get here.  Here is the code that takes more than an hour

result = [[train.iloc[idx].place_id for idx in idx_set] for idx_set in indexes]

Is there a faster way to do this? My goal here is to map all the indexes that get returned from KDTree to the place_ids.

Upvotes: 1

Views: 1307

Answers (1)

Andreas Hsieh
Andreas Hsieh

Reputation: 2160

Maybe you can give it a try, since you don't want distance from query:

indexes = tree.query(test[['x','y']],k=30,return_distance=False,dualtree=True,sort_results=False)

This might reduce some computation time for the first part. For the second part , I am thinking about flatten or reshape indexes and slice place_id instead of dual loops. Can you provide the format of result? Is it just a simple list?

Upvotes: 1

Related Questions