Reputation: 31
So, the situation is:
I have two numpy 2d arrays/pandas dataframes (doesn't matter, what I will use).Each of them contains approximately 106 records.Each record is a row with 10 float numbers.
I need to replace each row in second array(dataframe) with row from the first table, which has the smallest MSE compared to it. I can easily do it with "for" loops, but it sounds horrifyingly slow. Is there nice and beautiful numpy/pandas solution I don't see?
P.S For example:
arr1: [[1,2,3],[4,5,6],[7,8,9]]
arr2:[[9,10,11],[3,2,1],[5,5,5]]
result should be:[[7,8,9],[1,2,3],[4,5,6]]
in this example there are 3 numbers in each record and 3 records total. I have 10 numbers in each record, and around 1000000 records total
Upvotes: 0
Views: 103
Reputation: 661
Using a nearest neighbor method should work here, especially if you want to cut down on computation time.
I'll give a simple example using scikit-learn
's NearestNeighbor
class, though there are probably even more efficient ways to do this.
import numpy as np
from sklearn.neighbors import NearestNeighbors
# Example data
X = np.random.randint(1000, size=(10000, 10))
Y = np.random.randint(1000, size=(10000, 10))
def map_to_nearest(source, query):
neighbors = NearestNeighbors().fit(source)
indices = neighbors.kneighbors(query, 1, return_distance=False)
return query[indices.ravel()]
result = map_to_nearest(X, Y)
I'd note that this is calculating euclidean distances, not MSE. This should be fine for finding the closest match, since MSE is the squared euclidean distance.
Upvotes: 1