ChngFong Chia
ChngFong Chia

Reputation: 147

Combining 2 dataframes with different number of columns and aggregate them with sum

So I have these 2 dataframes

df1
title      URL  number         date
    a   /url-1       1   21-02-2020
    a   /url-1      10   20-02-2020
    a   /url-1      17   19-02-2020
    b   /url-2     100   21-02-2020
    b   /url-2     106   20-02-2020
df2
   URL  number         date
/url-1       5   21-02-2020
/url-1      12   20-02-2020
/url-1      50   19-02-2020
/url-3       9   21-02-2020
/url-3      11   20-02-2020

So I need to perform these actions to combine them into 1 dataframe:

1) Outer join (correct me if I'm wrong) df1 and df2 but I don't know how to show the results in 4 columns as I need to group by "URL" and "date" columns (will be illustrated below).

2) Aggregate df1["number"] and df2["number"] and sum them together

This is the result I want:

new_df
title      URL  number         date
    a   /url-1       6   21-02-2020
    a   /url-1      22   20-02-2020
    a   /url-1      67   19-02-2020
    b   /url-2     100   21-02-2020
    b   /url-2     106   20-02-2020
 null   /url-3       9   21-02-2020
 null   /url-3      11   20-02-2020

Your help is much appreciated! :)

Upvotes: 1

Views: 40

Answers (1)

sammywemmy
sammywemmy

Reputation: 28709

use pandas merge, set the how option as outer, and merge both dataframes on URL and date.
Sum the two numbers and keep only the columns that are relevant.

M = df1.merge(df2,how='outer', on=['URL','date'])
M['number'] = M.number_x.fillna(0).add(M.number_y.fillna(0))
M.filter(['title','URL','number','date'])

    title   URL number  date
0   a   /url-1  6.0 21-02-2020
1   a   /url-1  22.0    20-02-2020
2   a   /url-1  67.0    19-02-2020
3   b   /url-2  100.0   21-02-2020
4   b   /url-2  106.0   20-02-2020
5   NaN /url-3  9.0 21-02-2020
6   NaN /url-3  11.0    20-02-2020

Upvotes: 1

Related Questions