Reputation: 418
I am using kdtree from scikit-learn with a very large data set.
I can get kdtree to do the query in a somewhat reasonable time (20 minutes on my machine) but I can't map the indexes to the values they represent in any time less than 1 hour (I stop waiting after 1 hour).
I load up 2 csv files (train.csv has 29M records, test.csv has 8M records). I am interested in 3 keys. 'x','y' which are floats and 'placeid' which is a string.
from sklearn.neighbors import KDTree
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read.csv("test.csv")
tree = KDTree(train[['x','y']])
_, indexes = tree.query(test[['x','y']],k=30)
# takes 20 minutes to get here. Here is the code that takes more than an hour
result = [[train.iloc[idx].place_id for idx in idx_set] for idx_set in indexes]
Is there a faster way to do this? My goal here is to map all the indexes that get returned from KDTree to the place_ids.
Upvotes: 1
Views: 1307
Reputation: 2160
Maybe you can give it a try, since you don't want distance from query
:
indexes = tree.query(test[['x','y']],k=30,return_distance=False,dualtree=True,sort_results=False)
This might reduce some computation time for the first part.
For the second part , I am thinking about flatten or reshape
indexes and slice place_id
instead of dual loops. Can you provide the format of result
? Is it just a simple list?
Upvotes: 1