Python: Merge on 2 columns

Question

I'm working with a large dataset. The following is an example, calculated with a smaller dataset.

In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.

ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
          20,22,20,25,18,21,
          30,19,15,10,26,28]

dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)

    river_id  year  pollution  year_of_upstream_pollution
0          1  2000         10                      2002.0
1          1  2001         14                      2002.0
2          1  2002         20                      2003.0
3          1  2003         11                      2005.0
4          1  2004          8                      2005.0
5          1  2005         11                         NaN
6          2  1990         20                      1991.0
7          2  1991         22                      1992.0
8          2  1992         20                      1993.0
9          2  1993         25                      1994.0
10         2  1994         18                         NaN
11         2  1995         21                         NaN
12         3  2000         30                      2002.0
13         3  2001         19                      2002.0
14         3  2002         15                      2003.0
15         3  2003         10                      2004.0
16         3  2004         26                      2005.0
17         3  2005         28                         NaN

Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002

value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20

The resulting column should look like this:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN

My own approach:

### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])

# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)

len(listr) # = 28

This results in the following ValueError:

"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28

I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!

Quang Hoang · Accepted Answer

As you said in the title, this is merge on two column:

dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
                                                  right_on=['river_id','year_of_upstream_pollution'], 
                                                  how='right')['pollution_x']
print(df)

Output:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN

Python: Merge on 2 columns

Answers (2)

Related Questions