Kryo Bright
Kryo Bright

Reputation: 31

Replace each record with closest in numpy array/pandas dataframe

So, the situation is:

I have two numpy 2d arrays/pandas dataframes (doesn't matter, what I will use).Each of them contains approximately 106 records.Each record is a row with 10 float numbers.

I need to replace each row in second array(dataframe) with row from the first table, which has the smallest MSE compared to it. I can easily do it with "for" loops, but it sounds horrifyingly slow. Is there nice and beautiful numpy/pandas solution I don't see?

P.S For example:

arr1: [[1,2,3],[4,5,6],[7,8,9]]

arr2:[[9,10,11],[3,2,1],[5,5,5]]

result should be:[[7,8,9],[1,2,3],[4,5,6]]

in this example there are 3 numbers in each record and 3 records total. I have 10 numbers in each record, and around 1000000 records total

Upvotes: 0

Views: 103

Answers (1)

ivirshup
ivirshup

Reputation: 661

Using a nearest neighbor method should work here, especially if you want to cut down on computation time.

I'll give a simple example using scikit-learn's NearestNeighbor class, though there are probably even more efficient ways to do this.

import numpy as np
from sklearn.neighbors import NearestNeighbors

# Example data
X = np.random.randint(1000, size=(10000, 10))
Y = np.random.randint(1000, size=(10000, 10))

def map_to_nearest(source, query):
    neighbors = NearestNeighbors().fit(source)
    indices = neighbors.kneighbors(query, 1, return_distance=False)
    return query[indices.ravel()]

result = map_to_nearest(X, Y)

I'd note that this is calculating euclidean distances, not MSE. This should be fine for finding the closest match, since MSE is the squared euclidean distance.

Upvotes: 1

Related Questions