Matching nearest values in two dataframes of different lengths

Question

If I have two dataframes of different lengths, different labels and different levels of digit precision like so:

df1 = pd.DataFrame({'a':np.array([1.2345,2.2345,3.2345]),'b':np.array([4.123,5.123,6.123])})
df2 = pd.DataFrame({'A':np.array([1.2346,2.2343]),'B':np.array([4.1232,5.1239])})

How can I find the rows where the two dataframes have approximately matching values between columns 'a' and 'A' (say within 2 digits of precision) that results in a dataframe like so

      a         b         A         B
------------------------------------------------
 | 1.2345  |  4.123  |  1.2346  |  4.1232  |
 | 2.2345  |  5.123  |  2.2343  |  5.1239  |

Attempts:

Attempt #1:

matches_df = pd.merge(df1, df2,  how='inner', left_on=['a'], right_on = ['A'])

This only works if there are exact matches between columns 'a' and 'A' but I'm not sure how to incorporate a fudge factor to allow matching rows that are within 2 digits of precision.

Attempt #2

matches_df = df1.loc[np.round(df1['a'],2)==np.round(df2['A'],2)]

This gives the error "ValueError: Can only compare identically-labeled Series objects" because I think the two dataframes have different labels ('a','b' and 'A','B').

Any ideas on how this can be accomplished?

aminrd · Accepted Answer

Using KDTree, you can find the closest math in df1 in m O(log n) which n is the number of elements in df2 and m number of elements in df1.

import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

df1 = pd.DataFrame({'a':np.array([1.2345,2.2345,3.2345]),'b':np.array([4.123,5.123,6.123])})
df2 = pd.DataFrame({'A':np.array([1.2346,2.2343]),'B':np.array([4.1232,5.1239])})


def spatial_merge_NN(df1, df2, xyz=['A', 'B']):
    ''' Add features from df2 to df1, taking closest point '''
    tree = cKDTree(df2[xyz].values)
    dists, indices = tree.query(df1[['a','b']].values, k=1)
    fts = [c for c in df2.columns]
    for c in fts:
        df1[c] = df2[c].values[indices]
    return df1

df_new = spatial_merge_NN(df1, df2, ['A', 'B'])
#         a      b       A       B
# 0  1.2345  4.123  1.2346  4.1232
# 1  2.2345  5.123  2.2343  5.1239
# 2  3.2345  6.123  2.2343  5.1239

It put one dataframe constant ( in this case df1) and iterate through df2 and find the closest pair from d2 and add that row.

Matching nearest values in two dataframes of different lengths

Answers (1)

Related Questions