RJF
RJF

Reputation: 447

Select rows from DataFrame B based on values in rows in DataFrame A

I have two dataframes. Dataframe A is:

[distance]      [measure]
17442.77000     32.792658
17442.95100     32.792658
17517.49200     37.648482
17518.29600     37.648482
17565.77600     38.287118
17565.88800     38.287118
17596.93700     41.203340
17597.29700     41.203340
17602.16400     41.477979
17602.83900     41.612774
17618.16400     42.479890
17618.71100     42.681591

and data frame B which is:

[mileage]      [Driver]
17442.8         name1
17517.5         name2
17565.8         name3
17597.2         name4
17602.5         name5
17618.4         name6

for every [mileage] row in dataframe B, I'd like to find the two rows from [distance] in dataframe A in which A.loc[(A['distance']>= milage_value) & A['distance']<= mileage_value]so I could have something like this:

17442.77000     32.792658
17442.8         name1
17442.95100     32.792658
17517.49200     37.648482
17517.5         name2
17518.29600     37.648482
.               .
.               .

So I could apply the following function in a rolling window size of 3:

def f(x):
    return df.iloc[0,1]+(df.iloc[2,1]-df.iloc[0,1])*((df.iloc[1,0]-df.iloc[0,0])/(df.iloc[2,0]-df.iloc[0,0]))
a = df.rolling(window=3, min_periods=1).apply(f)[::3].reset_index(drop=True)

So far I have been concatenating the two Dfs and sorting values to generate triplets like above, but the problem arise when two values from df B is within a distance range in A[distance]. Any hint/advice is much appreciated!

Upvotes: 2

Views: 78

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

I think you can use the following with merge_asof using direction parameter and drop_duplicates:

df_before = pd.merge_asof(df_a, df_b, 
                 left_on='distance', 
                 right_on='mileage', 
                 direction='backward')\
              .drop_duplicates(['mileage','Driver'], keep='first')[['distance','measure']]

df_after = pd.merge_asof(df_a, df_b, 
                         left_on='distance', 
                         right_on='mileage', direction='forward')\
             .drop_duplicates(['mileage', 'Driver'], keep='last')[['distance','measure']]

df_middle = df_b.rename(columns={'Driver':'measure','mileage':'distance'})

pd.concat([df_before, df_middle, df_after]).sort_values('distance').drop_duplicates()

Output:

     distance  measure
0   17442.770  32.7927
0   17442.800    name1
1   17442.951  32.7927
2   17517.492  37.6485
1   17517.500    name2
3   17518.296  37.6485
4   17565.776  38.2871
2   17565.800    name3
5   17565.888  38.2871
6   17596.937  41.2033
3   17597.200    name4
7   17597.297  41.2033
8   17602.164   41.478
4   17602.500    name5
9   17602.839  41.6128
10  17618.164  42.4799
5   17618.400    name6
11  17618.711  42.6816

Upvotes: 1

Related Questions