Reputation: 55
Goal:
I want to merge two dataframes df1 and df2 with a tolerance in an efficient way using python. df1 has shape (l, 2) and df2 has shape (p, 13) with l < m < p. My target dataframe df3 with shape (m, 13) is supposed to contain all matches within the tolerance and not only the closest match.
I want to merge Col0 of df1 with Col2 of df2 with a tolerance "tolerance".
Example:
df1:
Index, Col0, Col1
0, 1008.5155, n01
df2:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ...
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...
df3:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...
To visualize, col1 of df3 gives me the difference of the respective value of df1 and df2. Hence, it has to be smaller than the tolerance.
My current solution takes a lot of time and requires a lot of memory.
# Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)
# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])
for n in range(len(df1)):
for k in range(len(df2)):
if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
df3_row[0] = df1.iloc[n,0]
df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
df3_row[2] = df2.iloc[k,2]
df3_row[3] = df2.iloc[k,3]
df3_row[4] = df1.iloc[n,1]
df3_row[5] = df2.iloc[k,5]
.
.
.
df3_array = np.append(df3_array, df3_row)
# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)
I have also tried to get both indices in one go with
[[n, k] for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]
However, it only gives me an empty array, so I am doing it wrong.
For the respective arrays, I have also tried to use
np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]
However, np.isclose + np.nonzero only got me indices of df2 and also many more than with my loop-intensive approach. Without the corresponding indices of df1, I am kind of lost. I think this last approach is the most promising, yet I seem unable to merge the data set because the values are no exact match and because the closest match is not always the correct solution. Any ideas how to overcome this problem?
Upvotes: 2
Views: 1881
Reputation: 7058
You need to divide this problem in parts
using np.isclose
, this is a very simple generator function which yields a DataFrame
containing the index of df1
and df2
which are close for each row of df1
def find_close(df1, df1_col, df2, df2_col, tolerance=1):
for index, value in df1[df1_col].items():
indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
yield s
Then we can easily concatenate these to get use a helper DataFrame containing the different indices.
df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)
To test this I added a 2nd record to df1
df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''
idx1 idx2 0 0 1 1 0 2 2 1 0
using pd.merge
df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)
Col0_x Col1_x Col0_y Col1_y Col2 Col3 Col4 Col5 Col6 ... 0 1008.5155 n01 0 0 1007.6176 k13 0 k15 k16 ... 1 1008.5155 n01 0 0 1008.6248 k123 0 k25 k26 ... 2 510.0 n03 0 0 510.0103 k03 0 k05 k06 ...
You'll need to rename a few columns, and assign the diff between them, but that should be trivial
Upvotes: 1