Reputation: 33
I have two dataframes
df1:
A | B | C | val0 |
---|---|---|---|
A0 | B0 | C0 | V0 |
A0 | B0 | C0 | V1 |
A1 | B1 | C1 | V2 |
A2 | B2 | C2 | V3 |
df2:
A | B | C | val1 |
---|---|---|---|
A0 | B0 | C0 | V4 |
A1 | B1 | C1 | V5 |
A1 | B1 | C1 | V6 |
A3 | B3 | C3 | V7 |
I want to combine them wrt A,B and C. So that, I can visualize val0 and val1 together.
Output:
A | B | C | val0 | val1 |
---|---|---|---|---|
A0 | B0 | C0 | V0 | V4 |
A0 | B0 | C0 | V1 | |
A1 | B1 | C1 | V2 | V5 |
A1 | B1 | C1 | V6 | |
A2 | B2 | C2 | V3 | |
A3 | B3 | C3 | V7 |
I tried:
pd.merge(df1,df2, how='outer', on = ['A','B','C'])
But since the (A,B,C) tuples aren't unique, it gives me a lot of repetitions. Also, (A,B,C) tuples can be non-overlapping between the two dataframes.
I just want to see val0 and val1 together in a compact manner for given tuple, so the order doesn't matter - eg. in the given output, it is okay if val1 = V4 in either index=0 or index=1 (but should not appear in both)
Upvotes: 0
Views: 194
Reputation: 11633
An alternative way to
see val0 and val1 together in a compact manner for given tuple
would be to combine them as lists:
to_list = lambda x: [i for i in x if i is not np.nan]
index_labels = ['A', 'B', 'C']
pd.concat([df1, df2]).groupby(index_labels).agg({'val0': to_list, 'val1': to_list})
Output:
val0 val1
A B C
A0 B0 C0 [V0, V1] [V4]
A1 B1 C1 [V2] [V5, V6]
A2 B2 C2 [V3] []
A3 B3 C3 [] [V7]
Upvotes: 0
Reputation: 261964
You can set a helper column (key
) for the merge to render your duplicated rows unique, then merge
:
cols = ['A', 'B', 'C']
(df1.assign(key=df1.groupby(cols).cumcount())
.merge(df2.assign(key=df2.groupby(cols).cumcount()),
on=cols+['key'],
how='outer'
)
.sort_values(by=cols)
.drop('key', axis=1)
)
output:
A B C val0 val1
0 A0 B0 C0 V0 V4
1 A0 B0 C0 V1 NaN
2 A1 B1 C1 V2 V5
4 A1 B1 C1 NaN V6
3 A2 B2 C2 V3 NaN
5 A3 B3 C3 NaN V7
Upvotes: 2
Reputation: 11633
This is close to what you want I think.
index_labels = ['A', 'B', 'C']
pd.concat([df1.set_index(index_labels),
df2.set_index(index_labels)]).sort_index()
Output:
val0 val1
A B C
A0 B0 C0 V0 NaN
C0 V1 NaN
C0 NaN V4
A1 B1 C1 V2 NaN
C1 NaN V5
C1 NaN V6
A2 B2 C2 V3 NaN
A3 B3 C3 NaN V7
Or, alternatively:
pd.concat([df1, df2]).sort_values(by=index_labels)
Output:
A B C val0 val1
0 A0 B0 C0 V0 NaN
1 A0 B0 C0 V1 NaN
0 A0 B0 C0 NaN V4
2 A1 B1 C1 V2 NaN
1 A1 B1 C1 NaN V5
2 A1 B1 C1 NaN V6
3 A2 B2 C2 V3 NaN
3 A3 B3 C3 NaN V7
Upvotes: 0