user1873789
user1873789

Reputation: 43

Faster way to index pandas dataframe multiple times

For every row in df_a, I am looking to find rows in df_b where the id's are the same and the df_a row's location falls within the df_b row's start and end location.

df_a looks like:

|---------------------|------------------|------------------|
|      Name           |     id           |     location     |
|---------------------|------------------|------------------|
|          a          |         1        |       202013     |
|---------------------|------------------|------------------|

df_b looks like:

|---------------------|------------------|------------------|------------------|
|      Name           |     id           |  location_start  |  location_end    |
|---------------------|------------------|------------------|------------------|
|          x          |         1        |       202010     |       2020199    |
|---------------------|------------------|------------------|------------------|

Unfortunately, df_a and df_b are both nearly a million rows. This code is taking like 10 hours to run on my local. Currently I'm running the following:

for index,row in df_a.iterrows():
         matched = df_b[(df_b['location_start']<row['location']) 
                  & (df_b['location_end']>row['location']) 
                  & (df_b['id']==row['id'])]

Is there any obvious way to speed this up?

Upvotes: 1

Views: 127

Answers (1)

Mayank Porwal
Mayank Porwal

Reputation: 34046

You can do this:

Consider my sample dataframes below:

In [90]: df_a = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location':[202013, 102013]})                                                                                                                   

In [91]: df_b = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location_start':[202010, 1020199],'location_end':[2020199, 1020299] })

In [92]: df_a                                                                                                                                                                                               
Out[92]: 
  Name  id  location
0    a   1    202013
1    b   2    102013

In [93]: df_b                                                                                                                                                                                               
Out[93]: 
  Name  id  location_start  location_end
0    a   1          202010       2020199
1    b   2         1020199       1020299

In [95]: d = pd.merge(df_a, df_b, on='id')

In [106]: indexes = d[d['location'].between(d['location_start'], d['location_end'])].index.tolist()                                                                                                         

In [107]: df_b.iloc[indexes, :]                                                                                                                                                                             
Out[107]: 
  Name  id  location_start  location_end
0    a   1          202010       2020199

Upvotes: 1

Related Questions