Jato
Jato

Reputation: 61

Python: Merge on 2 columns

I'm working with a large dataset. The following is an example, calculated with a smaller dataset.

In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.

ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
          20,22,20,25,18,21,
          30,19,15,10,26,28]

dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)

    river_id  year  pollution  year_of_upstream_pollution
0          1  2000         10                      2002.0
1          1  2001         14                      2002.0
2          1  2002         20                      2003.0
3          1  2003         11                      2005.0
4          1  2004          8                      2005.0
5          1  2005         11                         NaN
6          2  1990         20                      1991.0
7          2  1991         22                      1992.0
8          2  1992         20                      1993.0
9          2  1993         25                      1994.0
10         2  1994         18                         NaN
11         2  1995         21                         NaN
12         3  2000         30                      2002.0
13         3  2001         19                      2002.0
14         3  2002         15                      2003.0
15         3  2003         10                      2004.0
16         3  2004         26                      2005.0
17         3  2005         28                         NaN

Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002

The resulting column should look like this:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN  

My own approach:

### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])

# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)

len(listr) # = 28

This results in the following ValueError:

I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!

Upvotes: 1

Views: 1955

Answers (2)

Jato
Jato

Reputation: 61

I just realized that this solution doesn't seem to be working for me. When i execute the code, this is what happens:

dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
                                                  right_on=['river_id','year_of_upstream_pollution'], 
                                                  how='right')['pollution_x']
print(dfr1)
    river_id  year  pollution  year_of_upstream_pollution  \
0          1  2000         10                      2002.0   
1          1  2001         14                      2002.0   
2          1  2002         20                      2003.0   
3          1  2003         11                      2005.0   
4          1  2004          8                      2005.0   
5          1  2005         11                         NaN   
6          2  1990         20                      1991.0   
7          2  1991         22                      1992.0   
8          2  1992         20                      1993.0   
9          2  1993         25                      1994.0   
10         2  1994         18                         NaN   
11         2  1995         21                         NaN   
12         3  2000         30                      2002.0   
13         3  2001         19                      2002.0   
14         3  2002         15                      2003.0   
15         3  2003         10                      2004.0   
16         3  2004         26                      2005.0   
17         3  2005         28                         NaN   

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                           22.0  
6                           20.0  
7                           25.0  
8                           18.0  
9                           15.0  
10                          15.0  
11                          10.0  
12                          26.0  
13                          28.0  
14                           NaN  
15                           NaN  
16                           NaN  
17                           NaN  

For some reason, this code doesn't seem to be handling the "NaN" values in the right way.

  • If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
  • Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
  • On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.

@Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?

Does anyone know how i can get this code to work?

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150805

As you said in the title, this is merge on two column:

dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
                                                  right_on=['river_id','year_of_upstream_pollution'], 
                                                  how='right')['pollution_x']
print(df)

Output:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN  

Upvotes: 1

Related Questions