Vineet
Vineet

Reputation: 33

Join two pandas dataframes (union) on multiple columns without repetition

I have two dataframes

df1:

A B C val0
A0 B0 C0 V0
A0 B0 C0 V1
A1 B1 C1 V2
A2 B2 C2 V3

df2:

A B C val1
A0 B0 C0 V4
A1 B1 C1 V5
A1 B1 C1 V6
A3 B3 C3 V7

I want to combine them wrt A,B and C. So that, I can visualize val0 and val1 together.

Output:

A B C val0 val1
A0 B0 C0 V0 V4
A0 B0 C0 V1
A1 B1 C1 V2 V5
A1 B1 C1 V6
A2 B2 C2 V3
A3 B3 C3 V7

I tried:

pd.merge(df1,df2, how='outer', on = ['A','B','C'])

But since the (A,B,C) tuples aren't unique, it gives me a lot of repetitions. Also, (A,B,C) tuples can be non-overlapping between the two dataframes.

I just want to see val0 and val1 together in a compact manner for given tuple, so the order doesn't matter - eg. in the given output, it is okay if val1 = V4 in either index=0 or index=1 (but should not appear in both)

Upvotes: 0

Views: 194

Answers (3)

Bill
Bill

Reputation: 11633

An alternative way to

see val0 and val1 together in a compact manner for given tuple

would be to combine them as lists:

to_list = lambda x: [i for i in x if i is not np.nan]
index_labels = ['A', 'B', 'C']
pd.concat([df1, df2]).groupby(index_labels).agg({'val0': to_list, 'val1': to_list})

Output:

              val0      val1
A  B  C                     
A0 B0 C0  [V0, V1]      [V4]
A1 B1 C1      [V2]  [V5, V6]
A2 B2 C2      [V3]        []
A3 B3 C3        []      [V7]

Upvotes: 0

mozway
mozway

Reputation: 261964

You can set a helper column (key) for the merge to render your duplicated rows unique, then merge:

cols = ['A', 'B', 'C']
(df1.assign(key=df1.groupby(cols).cumcount())
    .merge(df2.assign(key=df2.groupby(cols).cumcount()),
           on=cols+['key'],
           how='outer'
          )
    .sort_values(by=cols)
    .drop('key', axis=1)
)

output:

     A    B    C val0 val1
0  A0   B0   C0    V0   V4
1  A0   B0   C0    V1  NaN
2  A1   B1   C1    V2   V5
4  A1   B1   C1   NaN   V6
3  A2   B2   C2    V3  NaN
5  A3   B3   C3   NaN   V7

Upvotes: 2

Bill
Bill

Reputation: 11633

This is close to what you want I think.

index_labels = ['A', 'B', 'C']
pd.concat([df1.set_index(index_labels), 
           df2.set_index(index_labels)]).sort_index()

Output:

         val0 val1
A  B  C           
A0 B0 C0   V0  NaN
      C0   V1  NaN
      C0  NaN   V4
A1 B1 C1   V2  NaN
      C1  NaN   V5
      C1  NaN   V6
A2 B2 C2   V3  NaN
A3 B3 C3  NaN   V7

Or, alternatively:

pd.concat([df1, df2]).sort_values(by=index_labels)

Output:

    A   B   C val0 val1
0  A0  B0  C0   V0  NaN
1  A0  B0  C0   V1  NaN
0  A0  B0  C0  NaN   V4
2  A1  B1  C1   V2  NaN
1  A1  B1  C1  NaN   V5
2  A1  B1  C1  NaN   V6
3  A2  B2  C2   V3  NaN
3  A3  B3  C3  NaN   V7

Upvotes: 0

Related Questions