Reputation: 61
I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
Upvotes: 1
Views: 1955
Reputation: 61
I just realized that this solution doesn't seem to be working for me. When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
@Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?
Upvotes: 0
Reputation: 150805
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
Upvotes: 1