Reputation: 61
I'm working with a large dataset and have the following issue: Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
Upvotes: 1
Views: 177
Reputation: 456
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
Upvotes: 2