magic_frank
magic_frank

Reputation: 161

Best way to concat/merge list of pandas dataframes into one while preserving all values in a given column

I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?

DF 1:

Col1 Col2
BLAH1 A
BLAH2 Z

DF 2:

Col1 Col2 Col3
BLAH2 Z B
BLAH3 Q K

DF 3:

Col1 Col4
BLAH2 C
BLAH3 W

Desired Outcome

Col1 Col2 Col3 Col4
BLAH1 A NaN NaN
BLAH2 Z B C
BLAH3 Q K W

Upvotes: 3

Views: 816

Answers (3)

tlentali
tlentali

Reputation: 3455

We can use reduce and merge like so :

from functools import reduce

reduce(lambda left, right: pd.merge(left, right, on=list(left.columns.intersection(right.columns)), how='outer'), [df1, df2, df3])

Here the reduce apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left, is the accumulated value and the right argument, right, is the update value from the iterable.

The trick here is to merge on the list of common columns from the DataFrames to get the expected result. @QuangHoang did something similar and find the trick before me.

Output :

    Col1    Col2    Col3    Col4
0   BLAH1   A       NaN     NaN
1   BLAH2   Z       B       C
2   BLAH3   Q       K       W

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153510

Let's use functools.reduce with pd.DataFrame.combine_first and comprehension:

from functools import reduce
reduce(lambda x, y: x.combine_first(y), 
       (df.set_index('Col1') for df in [df1,df2,df3])).reset_index()

Output:

    Col1 Col2 Col3 Col4
0  BLAH1    A  NaN  NaN
1  BLAH2    Z    B    C
2  BLAH3    Q    K    W

Given input dataframes as:

df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
                   'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                   'Col2':[*'ZQ'],
                   'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                    'Col4':[*'CW']})

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

If the keys are unique, within each dataframe, you can do a concat then groupby:

list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()

In general, you can combine a reduce and merge:

from functools import reduce

# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)

reduce(lambda x,y: x.merge(y, on=commons), list_dfs)

Upvotes: 4

Related Questions