Optimization - matching sequence of words in dataframe column

Question

I have a list of sequences of words and I'm trying to determine whether a string column contains any of the sequences in the list. If there is any match, the new column should contain 1, otherwise it should be 0.
The below code achieves that, however it does not scale well for large data.

import numpy as np
import pandas as pd
import re

data = {'TextVar' : ['this should never match',
'matches foo bar',
'this is the second random pattern',
np.nan,
'foo bars, should return 0',
'foo bar, with a comma, should return 1']}

df = pd.DataFrame(data)
patterns = ['foo bar', 'second random pattern', 'pink unicorns',]

def stringFound(string1, string2):
    """
    string1 = pattern to look for
    string2 = string to look in
    """
    if pd.isnull(string1) or pd.isnull(string2):
        return False
    if re.search(r"\b" + re.escape(string1) + r"\b", string2):
        return True
    return False

def hasPattern(pattern_list, text):
    for e in pattern_list:
       if stringFound(e, text):
           return 1        
    return 0

df['Output'] = df.apply(lambda x :hasPattern(patterns, x['TextVar']), axis=1)

I tried running this on a list of 5000 sequences ( len(patterns) = 5000) and with 15000 rows in the dataframe and after 30 minutes it is still running. I realize that I'm actually iterating a potential 75 million times - how could I write this in order to be more time efficient?

MaxU - stand with Ukraine · Accepted Answer

In [16]: pat = '|'.join([r'\b{}\b'.format(x) for x in patterns])

In [17]: pat
Out[17]: '\bfoo bar\b|\bsecond random pattern\b|\bpink unicorns\b'

In [18]: df['TextVar'].fillna('').str.contains(pat).astype(np.int8)
Out[18]:
0    0
1    1
2    1
3    0
4    0
5    1
Name: TextVar, dtype: int8

PS in case of using more complex patterns, try to use a pattern from @Wiktor Stribiżew:

pat = r'(?

Optimization - matching sequence of words in dataframe column

Answers (1)

Related Questions