Reputation: 161
I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?
DF 1:
Col1 | Col2 |
---|---|
BLAH1 | A |
BLAH2 | Z |
DF 2:
Col1 | Col2 | Col3 |
---|---|---|
BLAH2 | Z | B |
BLAH3 | Q | K |
DF 3:
Col1 | Col4 |
---|---|
BLAH2 | C |
BLAH3 | W |
Desired Outcome
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
BLAH1 | A | NaN | NaN |
BLAH2 | Z | B | C |
BLAH3 | Q | K | W |
Upvotes: 3
Views: 816
Reputation: 3455
We can use reduce
and merge
like so :
from functools import reduce
reduce(lambda left, right: pd.merge(left, right, on=list(left.columns.intersection(right.columns)), how='outer'), [df1, df2, df3])
Here the reduce
apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left
, is the accumulated value and the right argument, right
, is the update value from the iterable.
The trick here is to merge
on the list of common columns from the DataFrames to get the expected result. @QuangHoang did something similar
and find the trick before me.
Output :
Col1 Col2 Col3 Col4
0 BLAH1 A NaN NaN
1 BLAH2 Z B C
2 BLAH3 Q K W
Upvotes: 2
Reputation: 153510
Let's use functools.reduce
with pd.DataFrame.combine_first
and comprehension:
from functools import reduce
reduce(lambda x, y: x.combine_first(y),
(df.set_index('Col1') for df in [df1,df2,df3])).reset_index()
Output:
Col1 Col2 Col3 Col4
0 BLAH1 A NaN NaN
1 BLAH2 Z B C
2 BLAH3 Q K W
Given input dataframes as:
df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
'Col2':[*'ZQ'],
'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
'Col4':[*'CW']})
Upvotes: 2
Reputation: 150785
If the keys are unique, within each dataframe, you can do a concat then groupby:
list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()
In general, you can combine a reduce
and merge
:
from functools import reduce
# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)
reduce(lambda x,y: x.merge(y, on=commons), list_dfs)
Upvotes: 4