Extracting rows from a pandas dataframe based on values in another dataframe

Question

I have two data frames. One dataframe (A) looks like:

Name     begin    stop    ID      
Peter     30       150    1      
Hugo     4500     6000    2      
Jennie    300      700    3

The other dataframe (B) looks like

entry     string      
89         aa      
568        bb     
938437     cc

I want to accomplish two tasks here:

I want to get a list of indices for rows (from dataframe B) for which entry column falls in the interval (specified by begin and stop column) in dataframe A. The result for this task will be:

lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.

The indices that I get from task 1, I want to remove it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:

entry     string          
938437     cc

How can I accomplish these two tasks?

Ben.T · Accepted Answer

You can use merge_asof

l = (pd.merge_asof(dfB['entry'].reset_index() #to keep original index after merge
                      .sort_values('entry'), #mandatory to use this merge_asof
                   dfA[['begin','stop']].sort_values('begin'),
                   left_on='entry', right_on='begin',
                   direction='backward') # begin lower than entry
       .query('stop >= entry') # keep only where entry lower than stop
       ['index'].tolist()
    )
print(l)
# Int64Index([0, 1], dtype='int64')

new_df = dfB.loc[dfB.index.difference(l)]
print(new_df)
#     entry string
# 2  938437     cc

Now if you don't need the list onf index and that you real goal is the new_df, then you can do directly

new_df = (pd.merge_asof(dfB.sort_values('entry'), 
                        dfA[['begin','stop']].sort_values('begin'),
                        left_on='entry', right_on='begin',
                        direction='backward')
            .query('stop < entry') #here different inegality
            .drop(['begin','stop'], axis=1) #clean the result
            .reset_index(drop=True)
         )
print(new_df)

Extracting rows from a pandas dataframe based on values in another dataframe

Answers (2)

Related Questions