Nithin Mohan
Nithin Mohan

Reputation: 192

Pandas substring search for filter

I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.

Main

    col1 col2
0    1    take me home
1    2    country roads
2    2    country roads take
3    4    me home


Rules

   col3 col4
0    1    take
1    2    home
2    3    country
3    4    om
4    2    take

A row in main is marked as pass if the following condition matches for any row in rules

The condition for passing is: col1==col3 and col4 is substring of col2

 Main

     col1 col2               result
 0    1    take me home       Pass
 1    2    country roads      Fail
 2    2    country roads take Pass
 3    4    me home            Pass

My initial approach was to parse Rules df and create a function out of it dynamically and then run

    def action_function(row) -> object:
        if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
            return success_action(row) #mark as pass
        return fail_action(row) #mark as fail

    Main["result"] = self.df.apply(action_function, axis=1)

This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.

I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.

This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts. Can you please suggest an approach for this?

Upvotes: 1

Views: 107

Answers (1)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Merge and then apply the condtion using np.where i.e

temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)

no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)

   col1                col2 results
0     1        take me home    Pass
2     2       country roads    Fail
4     2  country roads take    Pass
5     4             me home    Pass

Upvotes: 1

Related Questions