data_person
data_person

Reputation: 4490

Pandas merge list of DFs based on grouping column value

I have a list of Pandas DFs with every DF having the same columns:

df1_values = [["2001-01-01","Lime",10],["2001-01-02","Lime",20]]
df2_values = [["2001-01-01","Mango",40],["2001-01-02","Mango",50],["2001-01-03","Mango",60]]
df3_values = [["2001-01-01","Orange",30]]
df1 = pd.DataFrame(df1_values,columns=["date","fruit","value"])
df2 = pd.DataFrame(df2_values,columns=["date","fruit","value"])
df3 = pd.DataFrame(df3_values,columns=["date","fruit","value"])
dfs = [df1,df2,df3]

one of the sample DFs --> DF1:

      date     fruit    value
0   2001-01-01  Lime    10
1   2001-01-02  Lime    20

Trying to merge all the DFs in the list in the format below (grouped on date), EXPECTED OP:

    date         fruit  value
  2001-01-01     Lime    10
  2001-01-01     Mango   40
  2001-01-01     Orange  30
  2001-01-02     Lime    20
  2001-01-02     Mango   50
  2001-01-03     Mango   60

Current iterative approach:

date_dict={}
for each_date in ["2001-01-01","2001-01-02","2001-01-03"]:
   for each_df in dfs:
       if each_date in date_dict:
        #append the values for this date
       else:
           #enter the values for this date

It is working but it is taking a long time.

Pandas Approach:

from functools import reduce
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['fruit'],
                                        how='outer'), dfs)

OP:

     date_x    fruit    value_x date_y     value_y  date    value
0   2001-01-01  Lime    10.0    NaN         NaN     NaN     NaN
1   2001-01-02  Lime    20.0    NaN         NaN     NaN     NaN
2   NaN         Mango   NaN    2001-01-01   40.0    NaN     NaN
3   NaN         Mango   NaN    2001-01-02   50.0    NaN     NaN
4   NaN         Mango   NaN    2001-01-03   60.0    NaN     NaN
5   NaN         Orange  NaN     NaN         NaN  2001-01-01 30.0

Any suggestions on how to correct the mistakes could be helpful.

Upvotes: 2

Views: 166

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195528

You can do pandas.concat followed by .sort_values:

print( pd.concat(dfs).sort_values('date') )

Prints:

         date   fruit  value
0  2001-01-01    Lime     10
0  2001-01-01   Mango     40
0  2001-01-01  Orange     30
1  2001-01-02    Lime     20
1  2001-01-02   Mango     50
2  2001-01-03   Mango     60

Upvotes: 3

Related Questions