Merge two dataframes on multiple keys with tolerance using merge_asof

Question

We wish to find the best matches out of multiple keys in two dataframes. From the documentation the method merge_asof seemed to be a reasonable choice. Testing it for one column it worked as expected.

import pandas as pd
import numpy as np
data_key_1 = np.arange(10).astype(float)
data_key_2 = np.arange(10).astype(float)
data_key_1_noisy = data_key_1-0.25*np.random.rand(10)
data_key_2_noisy = data_key_2-0.1*np.random.rand(10)
data_target = list('abcdefghij')

# one key approach
df_1 = pd.DataFrame(zip(data_key_1[::2], ), columns=['key_1',])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_target), columns=['key_1', 'target',])
df_result_1 = pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5)
print(df_result_1)

With console output as follow.

   key_1 target
0    0.0      a
1    2.0      c
2    4.0      e
3    6.0      g
4    8.0      i

When trying to use two keys, it failed. We tried different combinations of keyword settings but didn't manage to get it running.

# two keys approach
df_1 = pd.DataFrame(zip(data_key_1[::2], data_key_2[::2]), columns=['key_1', 'key2'])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_key_2_noisy, data_target), columns=['key_1', 'key2', 'target'])
df_result_2 = pd.merge_asof(df_1, df_2, on=['key_1', 'key_2'], direction='nearest', tolerance=0.5)
print(df_result_2)

It will throw an error:

pandas.errors.MergeError: can only asof on a key for left

Expected console output would have been.

   key_1  key2 target
0    0.0   0.0      a
1    2.0   2.0      c
2    4.0   4.0      e
3    6.0   6.0      g
4    8.0   8.0      i

So we questioned ourselves whether we try to apply this method in an inappropriate context, or if it is an applicable use-case and we just messed up with the keywords.

dermen · Accepted Answer

I believe you can also use KDTree to do this

from scipy.spatial import cKDTree
tree2 = cKDTree(df_2[["key_1", "key2"]])
# get distances and indices of nearest neighbors 
dists, inds = tree2.query(df_1[['key_1', 'key2']], 1)

# assign target if nearest neighbor is within tol
df_1['target'] = [df_2.loc[i, 'target'] if d < 0.5 else np.nan for d,i in zip(dists, inds)]

   key_1  key2 target
0    0.0   0.0      a
1    2.0   2.0      c
2    4.0   4.0      e
3    6.0   6.0      g
4    8.0   8.0      i

Previously I suggested query_ball_tree, but that method doesnt return nearest neighbors in any particular order, hence one should use query to directly grab the nearest neighbor.

Merge two dataframes on multiple keys with tolerance using merge_asof

Answers (2)

Related Questions