In Python, Need an Efficient way to map kdtree indexes to the values

Question

I am using kdtree from scikit-learn with a very large data set.

I can get kdtree to do the query in a somewhat reasonable time (20 minutes on my machine) but I can't map the indexes to the values they represent in any time less than 1 hour (I stop waiting after 1 hour).

I load up 2 csv files (train.csv has 29M records, test.csv has 8M records). I am interested in 3 keys. 'x','y' which are floats and 'placeid' which is a string.

from sklearn.neighbors import KDTree
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read.csv("test.csv")

tree = KDTree(train[['x','y']])
_, indexes = tree.query(test[['x','y']],k=30)

# takes 20 minutes to get here.  Here is the code that takes more than an hour

result = [[train.iloc[idx].place_id for idx in idx_set] for idx_set in indexes]

Is there a faster way to do this? My goal here is to map all the indexes that get returned from KDTree to the place_ids.

In Python, Need an Efficient way to map kdtree indexes to the values

Answers (1)

Related Questions