Wilian
Wilian

Reputation: 1257

Merge AND sum or concatenate (with TWO dataframes)

(This post does not explain what was requested below. There is the issue of merge before concatenation or sum, using these functions together to fill a column is the question.)

I have two dataframes that look like these (obviously these are examples):

data1 = {'Name': ['Alex','Alex','Cristiano','Cristiano','Fernando','Jonas','William'], 'Color': ['Blue','Red','Black','Blue','Yellow','Pink','Green'], 
    'Codes': ['','','','','','',''], 'Values': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
data2 = {'Name': ['Jonas','Alex','Cristiano','Cristiano','Alex'], 'Color': ['Pink','Red','Black','Blue','Red'], 
    'Codes': ['1456','1450','1453','1530','1459'], 'Values': [12000.00,5000.50,78000.00,2000.00,1500.00]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Are so:

df1:
        Name   Color Codes  Values
0       Alex    Blue           NaN
1       Alex     Red           NaN
2  Cristiano   Black           NaN
3  Cristiano    Blue           NaN
4   Fernando  Yellow           NaN
5      Jonas    Pink           NaN
6    William   Green           NaN
df2:
        Name  Color Codes   Values
0      Jonas   Pink  1456  12000.0
1       Alex    Red  1450   5000.5
2  Cristiano  Black  1453  78000.0
3  Cristiano   Blue  1530   2000.0
4       Alex    Red  1459   1500.0

I want to fill the 'Codes' and 'Values' columns of df1 with the corresponding values ​​in df2 ('Code' and 'Values'), doing the matching of the columns 'Name' and 'Color'.

The problem is that there is more than one correspondence, for the column 'Codes' I want the concatenation of strings in 'Code' (separating with a comma), for the column 'Values' I want the sum of the correspondences.

I tried using merge, but this second part I don't know how to implement. To be clearer, the desired output of df1 is this:

        Name   Color        Codes   Values
0       Alex    Blue                   NaN
1       Alex     Red  1450 , 1459   6500.5
2  Cristiano   Black         1453  78000.0
3  Cristiano    Blue         1530   2000.0
4   Fernando  Yellow                   NaN
5      Jonas    Pink         1456  12000.0
6    William   Green                   NaN

Upvotes: 1

Views: 71

Answers (1)

sophocles
sophocles

Reputation: 13821

You can use pd.merge() using df1 as it is and a grouped version of df2 (grouped by Name and Color). The aggregation should be list on Codes and sum on Values:

pd.merge(df1,
         df2.groupby(['Name','Color']).agg({'Codes':list,'Values':'sum'}).reset_index(),
         how='left',
         on=['Name','Color'],
         suffixes=('_x', ''))\
    .drop(['Codes_x','Values_x'],axis=1)

Which prints back:

        Name   Color         Codes   Values
0       Alex    Blue           NaN      NaN
1       Alex     Red  [1450, 1459]   6500.5
2  Cristiano   Black        [1453]  78000.0
3  Cristiano    Blue        [1530]   2000.0
4   Fernando  Yellow           NaN      NaN
5      Jonas    Pink        [1456]  12000.0
6    William   Green           NaN      NaN

Upvotes: 1

Related Questions