Replace each record with closest in numpy array/pandas dataframe

Question

So, the situation is:

I have two numpy 2d arrays/pandas dataframes (doesn't matter, what I will use).Each of them contains approximately 10⁶ records.Each record is a row with 10 float numbers.

I need to replace each row in second array(dataframe) with row from the first table, which has the smallest MSE compared to it. I can easily do it with "for" loops, but it sounds horrifyingly slow. Is there nice and beautiful numpy/pandas solution I don't see?

P.S For example:

arr1: [[1,2,3],[4,5,6],[7,8,9]]

arr2:[[9,10,11],[3,2,1],[5,5,5]]

result should be:[[7,8,9],[1,2,3],[4,5,6]]

in this example there are 3 numbers in each record and 3 records total. I have 10 numbers in each record, and around 1000000 records total

ivirshup · Accepted Answer

Using a nearest neighbor method should work here, especially if you want to cut down on computation time.

I'll give a simple example using scikit-learn's NearestNeighbor class, though there are probably even more efficient ways to do this.

import numpy as np
from sklearn.neighbors import NearestNeighbors

# Example data
X = np.random.randint(1000, size=(10000, 10))
Y = np.random.randint(1000, size=(10000, 10))

def map_to_nearest(source, query):
    neighbors = NearestNeighbors().fit(source)
    indices = neighbors.kneighbors(query, 1, return_distance=False)
    return query[indices.ravel()]

result = map_to_nearest(X, Y)

I'd note that this is calculating euclidean distances, not MSE. This should be fine for finding the closest match, since MSE is the squared euclidean distance.

Replace each record with closest in numpy array/pandas dataframe

Answers (1)

Related Questions