Reputation: 192
I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.
Main
col1 col2
0 1 take me home
1 2 country roads
2 2 country roads take
3 4 me home
Rules
col3 col4
0 1 take
1 2 home
2 3 country
3 4 om
4 2 take
A row in main
is marked as pass if the following condition matches for any row in rules
The condition for passing is: col1==col3 and col4 is substring of col2
Main
col1 col2 result
0 1 take me home Pass
1 2 country roads Fail
2 2 country roads take Pass
3 4 me home Pass
My initial approach was to parse Rules df and create a function out of it dynamically and then run
def action_function(row) -> object:
if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
return success_action(row) #mark as pass
return fail_action(row) #mark as fail
Main["result"] = self.df.apply(action_function, axis=1)
This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.
I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.
This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts. Can you please suggest an approach for this?
Upvotes: 1
Views: 107
Reputation: 30605
Merge and then apply the condtion using np.where i.e
temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)
no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)
col1 col2 results
0 1 take me home Pass
2 2 country roads Fail
4 2 country roads take Pass
5 4 me home Pass
Upvotes: 1