mr.bjerre
mr.bjerre

Reputation: 2888

Merge two tables in python (using pandas) where the link depends on uniqueness

Suppose I have two tables

import pandas as pd
import numpy as np

first_table = pd.DataFrame({'key1': [1, 2, 2, 2, 3, 3],
                            'key2': ['a', 'a', 'a', 'b', 'a', 'b'],
                            'key3': ['A', 'A', 'B', 'A', 'A', 'A'],
                            'value_first': range(6)})
second_table = pd.DataFrame({'key1': [1, 1, 2, 2, 3],
                            'key2': [np.nan, np.nan, 'a', 'a', 'b'],
                            'key3': [np.nan, np.nan, 'A', 'B', np.nan],
                            'value_second': [6, 4, 2, 0, -2]})

so the first table looks like this

   key1 key2 key3  value_first
0     1    a    A            0
1     2    a    A            1
2     2    a    B            2
3     2    b    A            3
4     3    a    A            4
5     3    b    A            5

while the second table looks like this

   key1 key2 key3  value_second
0     1  NaN  NaN             6
1     1  NaN  NaN             4
2     2    a    A             2
3     2    a    B             0
4     3    b  NaN            -2

Now I want an outer merge of first_table and second_table based on the three keys. Note that the second table is not unique based on the three keys but the first are. Hence key2 and key3 is not necessary when key1 is unique in the first table. In the same way key3 is not necessary when the first two keys are unique in combination.

If the second table were proberly filled in then the merge would be straightforward by

pd.merge(first_table, second_table, 
         how='outer', 
         left_on=['key1', 'key2', 'key3'], 
         right_on=['key1', 'key2', 'key3'])

but how do I get the desired result in this case? The desired result looks like this

   key1 key2 key3  value_first  value_second
0   1    a    A          0.0           6.0
1   1    a    A          0.0           4.0
2   2    a    A          1.0           2.0
3   2    a    B          2.0           0.0
4   2    b    A          3.0           NaN
5   3    a    A          4.0           NaN
6   3    b    A          5.0          -2.0

Upvotes: 1

Views: 95

Answers (1)

Dennis Golomazov
Dennis Golomazov

Reputation: 17339

The idea is to first merge the dataframes just on key1, then fill the NaN with respective values from first table, then drop rows when values are different, and finally merge with the first table again to get the remaining rows (where value_second=NaN).

df = pd.merge(first_table, second_table, left_on=['key1'], right_on=['key1'], how='outer')
df['key2_y'] = df['key2_y'].fillna(df['key2_x'])
df['key3_y'] = df['key3_y'].fillna(df['key3_x'])
df = df[(df['key2_x'] == df['key2_y']) & (df['key3_x'] == df['key3_y'])]
df.drop(['key2_y', 'key3_y'], axis=1, inplace=True)
df = pd.merge(first_table, df, left_on=['key1', 'key2', 'key3', 'value_first'], 
              right_on=['key1', 'key2_x', 'key3_x', 'value_first'], how='outer')
df.drop(['key2_x', 'key3_x'], axis=1, inplace=True)

       key1 key2 key3  value_first  value_second
0     1    a    A            0           6.0
1     1    a    A            0           4.0
2     2    a    A            1           2.0
3     2    a    B            2           0.0
4     2    b    A            3           NaN
5     3    a    A            4           NaN
6     3    b    A            5          -2.0

Upvotes: 1

Related Questions