Pandas substring search for filter

Question

I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.

Main

    col1 col2
0    1    take me home
1    2    country roads
2    2    country roads take
3    4    me home


Rules

   col3 col4
0    1    take
1    2    home
2    3    country
3    4    om
4    2    take

A row in main is marked as pass if the following condition matches for any row in rules

The condition for passing is: col1==col3 and col4 is substring of col2

 Main

     col1 col2               result
 0    1    take me home       Pass
 1    2    country roads      Fail
 2    2    country roads take Pass
 3    4    me home            Pass

My initial approach was to parse Rules df and create a function out of it dynamically and then run

    def action_function(row) -> object:
        if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
            return success_action(row) #mark as pass
        return fail_action(row) #mark as fail

    Main["result"] = self.df.apply(action_function, axis=1)

This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.

I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.

This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts. Can you please suggest an approach for this?

Pandas substring search for filter

Answers (1)

Related Questions