Dax Feliz
Dax Feliz

Reputation: 12929

Matching nearest values in two dataframes of different lengths

If I have two dataframes of different lengths, different labels and different levels of digit precision like so:

df1 = pd.DataFrame({'a':np.array([1.2345,2.2345,3.2345]),'b':np.array([4.123,5.123,6.123])})
df2 = pd.DataFrame({'A':np.array([1.2346,2.2343]),'B':np.array([4.1232,5.1239])})

How can I find the rows where the two dataframes have approximately matching values between columns 'a' and 'A' (say within 2 digits of precision) that results in a dataframe like so

      a         b         A         B
------------------------------------------------
 | 1.2345  |  4.123  |  1.2346  |  4.1232  |
 | 2.2345  |  5.123  |  2.2343  |  5.1239  |

Attempts:

Attempt #1:

matches_df = pd.merge(df1, df2,  how='inner', left_on=['a'], right_on = ['A'])

This only works if there are exact matches between columns 'a' and 'A' but I'm not sure how to incorporate a fudge factor to allow matching rows that are within 2 digits of precision.

Attempt #2

matches_df = df1.loc[np.round(df1['a'],2)==np.round(df2['A'],2)]

This gives the error "ValueError: Can only compare identically-labeled Series objects" because I think the two dataframes have different labels ('a','b' and 'A','B').

Any ideas on how this can be accomplished?

Upvotes: 1

Views: 699

Answers (1)

aminrd
aminrd

Reputation: 5050

Using KDTree, you can find the closest math in df1 in m O(log n) which n is the number of elements in df2 and m number of elements in df1.

import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

df1 = pd.DataFrame({'a':np.array([1.2345,2.2345,3.2345]),'b':np.array([4.123,5.123,6.123])})
df2 = pd.DataFrame({'A':np.array([1.2346,2.2343]),'B':np.array([4.1232,5.1239])})


def spatial_merge_NN(df1, df2, xyz=['A', 'B']):
    ''' Add features from df2 to df1, taking closest point '''
    tree = cKDTree(df2[xyz].values)
    dists, indices = tree.query(df1[['a','b']].values, k=1)
    fts = [c for c in df2.columns]
    for c in fts:
        df1[c] = df2[c].values[indices]
    return df1

df_new = spatial_merge_NN(df1, df2, ['A', 'B'])
#         a      b       A       B
# 0  1.2345  4.123  1.2346  4.1232
# 1  2.2345  5.123  2.2343  5.1239
# 2  3.2345  6.123  2.2343  5.1239

It put one dataframe constant ( in this case df1) and iterate through df2 and find the closest pair from d2 and add that row.

Upvotes: 2

Related Questions