Reputation: 2888
Suppose I have two tables
import pandas as pd
import numpy as np
first_table = pd.DataFrame({'key1': [1, 2, 2, 2, 3, 3],
'key2': ['a', 'a', 'a', 'b', 'a', 'b'],
'key3': ['A', 'A', 'B', 'A', 'A', 'A'],
'value_first': range(6)})
second_table = pd.DataFrame({'key1': [1, 1, 2, 2, 3],
'key2': [np.nan, np.nan, 'a', 'a', 'b'],
'key3': [np.nan, np.nan, 'A', 'B', np.nan],
'value_second': [6, 4, 2, 0, -2]})
so the first table looks like this
key1 key2 key3 value_first
0 1 a A 0
1 2 a A 1
2 2 a B 2
3 2 b A 3
4 3 a A 4
5 3 b A 5
while the second table looks like this
key1 key2 key3 value_second
0 1 NaN NaN 6
1 1 NaN NaN 4
2 2 a A 2
3 2 a B 0
4 3 b NaN -2
Now I want an outer merge of first_table
and second_table
based on the three keys. Note that the second table is not unique based on the three keys but the first are. Hence key2
and key3
is not necessary when key1
is unique in the first table. In the same way key3
is not necessary when the first two keys are unique in combination.
If the second table were proberly filled in then the merge would be straightforward by
pd.merge(first_table, second_table,
how='outer',
left_on=['key1', 'key2', 'key3'],
right_on=['key1', 'key2', 'key3'])
but how do I get the desired result in this case? The desired result looks like this
key1 key2 key3 value_first value_second
0 1 a A 0.0 6.0
1 1 a A 0.0 4.0
2 2 a A 1.0 2.0
3 2 a B 2.0 0.0
4 2 b A 3.0 NaN
5 3 a A 4.0 NaN
6 3 b A 5.0 -2.0
Upvotes: 1
Views: 95
Reputation: 17339
The idea is to first merge the dataframes just on key1
, then fill the NaN
with respective values from first table, then drop rows when values are different, and finally merge with the first table again to get the remaining rows (where value_second=NaN
).
df = pd.merge(first_table, second_table, left_on=['key1'], right_on=['key1'], how='outer')
df['key2_y'] = df['key2_y'].fillna(df['key2_x'])
df['key3_y'] = df['key3_y'].fillna(df['key3_x'])
df = df[(df['key2_x'] == df['key2_y']) & (df['key3_x'] == df['key3_y'])]
df.drop(['key2_y', 'key3_y'], axis=1, inplace=True)
df = pd.merge(first_table, df, left_on=['key1', 'key2', 'key3', 'value_first'],
right_on=['key1', 'key2_x', 'key3_x', 'value_first'], how='outer')
df.drop(['key2_x', 'key3_x'], axis=1, inplace=True)
key1 key2 key3 value_first value_second
0 1 a A 0 6.0
1 1 a A 0 4.0
2 2 a A 1 2.0
3 2 a B 2 0.0
4 2 b A 3 NaN
5 3 a A 4 NaN
6 3 b A 5 -2.0
Upvotes: 1