R overflow
R overflow

Reputation: 1352

Merge two Dataframes with different structures

I am trying to combine two dataframes with each other. But I am stuck on the logic (maybe I looked too long to the problem).

Problem: DF1 is in a format like this:

Country       ID    March    April  May  June 
Netherlands   A      10       20     15   14 
Netherlands   B      15       18     10    4
Germany       C       9        8      3    1 

Then, I have another dataframe:

DF2

Country      ID    Month   Number
Netherlands  A     March     8  
Netherlands  A     March     28  
Netherlands  B     April     18  
Germany      C     Jan       9

I would like to merge these two columns (in the same format as DF1). My assumption was that is should be something like, group by (Country + ID), and then (Month as column, and SUM the Number). - but as said before, I am confussed.

Desired output:

Country       ID    March    April  May  June   df2_Jan  df2_March   df2_April
Netherlands   A      10       20     15   14       0       36 #8+28     0
Netherlands   B      15       18     10    4       0        0          18
Germany       C       9        8      3    1       9        0           0

Hope that you can help me. Thanks a lot in advance!

Upvotes: 1

Views: 393

Answers (2)

Cimbali
Cimbali

Reputation: 11405

  • Summing in DF2 indeed is a simple groupby().sum() operation

    df2 = df2.groupby(['Country', 'ID', 'Month'])['Number'].sum()
    

    You now have a series with 3-level index.

  • It seems to make DF2 to the same format as DF1 you only need a single .unstack() which pivots one of the index levels to columns.

    df2 = df2.unstack('Month', fill_value=0)
    
  • Merging in same format is now easy

    df1.merge(df2.add_prefix('df2_').reset_index(), on=['Country', 'ID'])
    

Steps 1 and 2 can be replace with a single call to .pivot_table with the aggfunc parameter, as per @Nk03’s answer.

Upvotes: 1

Nk03
Nk03

Reputation: 14949

You can try:

merged_df = (
    df1.merge(
        df2.pivot_table(
            index=['Country', 'ID'],
            columns='Month',
            values='Number',
            aggfunc=sum,
            fill_value=0)
        .add_prefix('df2_')
        .reset_index(),
        on=['Country', 'ID']
    )
)

OUTPUT:

       Country ID  March  April  May  June  df2_April  df2_Jan  df2_March
0  Netherlands  A     10     20   15    14          0        0         36
1  Netherlands  B     15     18   10     4         18        0          0
2      Germany  C      9      8    3     1          0        9          0

Upvotes: 1

Related Questions