nurer
nurer

Reputation: 103

How can I add a new column based on two dataframes and conditions

How can I add a new column based on two dataframes and conditions? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want. It just compare row by row. I want to find: Is the first row in df inside df2 according to conditions? It means check all rows in df2 for each row in df. So the expected output should be as below:

Expected output: df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

Upvotes: 1

Views: 113

Answers (2)

Emma
Emma

Reputation: 9308

This is the alternative solution with Pandas vectorization. If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.

Thanks to @Timus's comment, you can first merge the 2 dataframes with left on year.

dfa = df.merge(df2, on='year', how='left', suffixes=('1', '2'))

Then, apply the conditions.

dfa['Result'] = ((dfa.x2 > dfa.x1 - 2.5) & 
                (dfa.x2 < dfa.x1 + 2.5) & 
                (dfa.y2 > dfa.y1 - 2.5) & 
                (dfa.y2 < dfa.y1 + 2.5))

Finally, you group by the df's x, y, year (x1, y1, year) and return True if any row's Result is True.

# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year']).Result.any().astype(int).reset_index()

Result

      x1     y1   year  Result
0    0.0  160.1   2003       0
1   40.1  100.1   2000       1
2   50.1  110.1   2000       1
3   60.1  120.1   2001       0
4   70.1  130.1   2001       0
5   80.1  140.1   2003       0
6   90.1  150.1   2003       0
7  300.1  400.1   2004       1

Upvotes: 1

Mael_Jourdain
Mael_Jourdain

Reputation: 506

You can loop through each DataFrame and check for all combinations.

for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if  (row['x']-2.5 < row2['x']  < row['x']+2.5) and (row['y']-2.5 < row2['y']  < row['y']+2.5):
            print(index,index2)
            df.loc[index, 'Result'] = 1

Upvotes: 0

Related Questions