barium
barium

Reputation: 53

Merge two dataframes on multiple keys with tolerance using merge_asof

We wish to find the best matches out of multiple keys in two dataframes. From the documentation the method merge_asof seemed to be a reasonable choice. Testing it for one column it worked as expected.

import pandas as pd
import numpy as np
data_key_1 = np.arange(10).astype(float)
data_key_2 = np.arange(10).astype(float)
data_key_1_noisy = data_key_1-0.25*np.random.rand(10)
data_key_2_noisy = data_key_2-0.1*np.random.rand(10)
data_target = list('abcdefghij')

# one key approach
df_1 = pd.DataFrame(zip(data_key_1[::2], ), columns=['key_1',])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_target), columns=['key_1', 'target',])
df_result_1 = pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5)
print(df_result_1)

With console output as follow.

   key_1 target
0    0.0      a
1    2.0      c
2    4.0      e
3    6.0      g
4    8.0      i

When trying to use two keys, it failed. We tried different combinations of keyword settings but didn't manage to get it running.

# two keys approach
df_1 = pd.DataFrame(zip(data_key_1[::2], data_key_2[::2]), columns=['key_1', 'key2'])
df_2 = pd.DataFrame(zip(data_key_1_noisy, data_key_2_noisy, data_target), columns=['key_1', 'key2', 'target'])
df_result_2 = pd.merge_asof(df_1, df_2, on=['key_1', 'key_2'], direction='nearest', tolerance=0.5)
print(df_result_2)

It will throw an error:

pandas.errors.MergeError: can only asof on a key for left

Expected console output would have been.

   key_1  key2 target
0    0.0   0.0      a
1    2.0   2.0      c
2    4.0   4.0      e
3    6.0   6.0      g
4    8.0   8.0      i

So we questioned ourselves whether we try to apply this method in an inappropriate context, or if it is an applicable use-case and we just messed up with the keywords.

Upvotes: 1

Views: 2946

Answers (2)

dermen
dermen

Reputation: 5372

I believe you can also use KDTree to do this

from scipy.spatial import cKDTree
tree2 = cKDTree(df_2[["key_1", "key2"]])
# get distances and indices of nearest neighbors 
dists, inds = tree2.query(df_1[['key_1', 'key2']], 1)

# assign target if nearest neighbor is within tol
df_1['target'] = [df_2.loc[i, 'target'] if d < 0.5 else np.nan for d,i in zip(dists, inds)]

   key_1  key2 target
0    0.0   0.0      a
1    2.0   2.0      c
2    4.0   4.0      e
3    6.0   6.0      g
4    8.0   8.0      i

Previously I suggested query_ball_tree, but that method doesnt return nearest neighbors in any particular order, hence one should use query to directly grab the nearest neighbor.

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150785

merge_asof is always one key only (think why it requires key to be sorted). You can merge on one key and filter the other key.

(pd.merge_asof(df_1, df_2, on='key_1', direction='nearest', tolerance=0.5, suffixes=['','_'])
   # may need to drop duplicate on `key_1, key2` here based on `abs` as well
   .loc[lambda x: x['key2'].sub(x['key2_']).abs() < 0.5]
   .drop(columns=['key2_'])  
)

Output:

   key_1  key2 target
0    0.0   0.0      a
1    2.0   2.0      c
2    4.0   4.0      e
3    6.0   6.0      g
4    8.0   8.0      i

Upvotes: 3

Related Questions