Jato
Jato

Reputation: 61

Pandas: Merge on 2 columns

I'm working with a large dataset and have the following issue: Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.

Example:

import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
       60,10,10,40,np.NaN,np.NaN,
       np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
       1991,1992,np.NaN,1995,1995,np.NaN,
       2001,2002,2004,2004,2005,np.NaN]

dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)

I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:

dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
    id  year  sub-input  y-arrival  input_sum_tf
0    1  2000       20.0     2002.0          20.0
1    1  2001       40.0     2004.0          80.0
2    1  2002       10.0     2004.0          80.0
3    1  2003       30.0     2004.0          80.0
4    1  2004       50.0     2005.0          50.0
5    1  2005       80.0        NaN           NaN
6    2  1990       60.0     1991.0          60.0
7    2  1991       10.0     1992.0          10.0
8    2  1992       10.0        NaN           NaN
9    2  1993       40.0     1995.0          40.0
10   2  1994        NaN     1995.0          40.0
11   2  1995        NaN        NaN           NaN
12   3  2000        NaN     2001.0           0.0
13   3  2001      120.0     2002.0         120.0
14   3  2002       30.0     2004.0         100.0
15   3  2003       70.0     2004.0         100.0
16   3  2004       60.0     2005.0          60.0
17   3  2005       90.0        NaN           NaN

Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.

The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:

The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).

The result ("input_sum") should look like this:

0       NaN
1       NaN
2      20.0
3       NaN
4      80.0
5      50.0
6       NaN
7      60.0
8      10.0
9       NaN
10      NaN
11     40.0
12      NaN
13      NaN
14    120.0
15      NaN
16    100.0
17     60.0

My approach:

dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
                                             right_on=['id','year'], 
                                             how='right')['input_sum_tf_x']
dfex3["input_sum"]
0       NaN
1       NaN
2      20.0
3       NaN
4      80.0
5      80.0
6      80.0
7      50.0
8       NaN
9      60.0
10     10.0
11      NaN
12      NaN
13     40.0
14     40.0
15      NaN
16      0.0
17    120.0

Any help would be much appreciated!

Upvotes: 1

Views: 177

Answers (1)

LRRR
LRRR

Reputation: 456

The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.

Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):

df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]

Then merge:

dfex3.merge(df_last, 
            left_on=['id', 'year'], 
            right_on=['id', 'y-arrival'],
            how='left')['input_sum_tf_y']

0       NaN
1       NaN
2      20.0
3       NaN
4      80.0
5      50.0
6       NaN
7      60.0
8      10.0
9       NaN
10      NaN
11     40.0
12      NaN
13      NaN
14    120.0
15      NaN
16    100.0
17     60.0

Upvotes: 2

Related Questions