aziz shaw
aziz shaw

Reputation: 144

finding KNN for larger Dataset

I am trying to find the nearest neighbors for a data set A consisting of 25000 rows, for that, I am trying to fit a dataset B to the KNN model that consists of 13 million rows, the goal is to find s 25000 rows of dataset B which are similar to dataset A

model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
model_knn.fit(B)

knn_distances,knn_indices=model_knn.kneighbors(A.values, n_neighbors=10)

here when I am fitting B up to 600000 rows there is no issues

model_knn.fit(knn_test_pd[:600000])

beyond 600000 the model is not fitting, there is no error but for fitting 600000 it takes 2 sec beyond 600000 its taking hours and the data I'm fitting is scaled data

I tried splitting the data frame and fitting it is it a correct approach? then also the model is taking hours to fit

splited_B=np.array_split(B, 113)
model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
for df in splited_B:
    model_knn.fit(df)

What shall I do to fit these big data to knn? Or is there another model similar to knn which can accept large datasets?

Upvotes: 0

Views: 2656

Answers (2)

Joseph S. Lubinda
Joseph S. Lubinda

Reputation: 11

Scikit-learn will work with smaller datasets and is limited to your main memory for loading and processing data. If you want to perform machine learning on larger datasets, I recommend using Apache Spark which is designed to run on multiple nodes, beating the limits you are currently facing. spark-sklearn will help you get the job done. Another option is to use random sampling. Get a smaller sample of data to represent your full dataset and perform your operations on it. You can even divide the samples into smaller ones and run your algorithm on them and compare the results from each. See this solution for random sampling.

Upvotes: 1

Kasra
Kasra

Reputation: 833

You can split dataset B into 600,000-row chunks, which gives you 22 datasets(respectively, 22 KNN models).

In prediction, for each row in A, find the nearest data point on each of those 22 models; this gives you 22 data points. And finally, search for the nearest point among those 22 points.

Upvotes: 2

Related Questions