Reputation: 1325
I have a list of sequences of words and I'm trying to determine whether a string column contains any of the sequences in the list. If there is any match, the new column should contain 1, otherwise it should be 0.
The below code achieves that, however it does not scale well for large data.
import numpy as np
import pandas as pd
import re
data = {'TextVar' : ['this should never match',
'matches foo bar',
'this is the second random pattern',
np.nan,
'foo bars, should return 0',
'foo bar, with a comma, should return 1']}
df = pd.DataFrame(data)
patterns = ['foo bar', 'second random pattern', 'pink unicorns',]
def stringFound(string1, string2):
"""
string1 = pattern to look for
string2 = string to look in
"""
if pd.isnull(string1) or pd.isnull(string2):
return False
if re.search(r"\b" + re.escape(string1) + r"\b", string2):
return True
return False
def hasPattern(pattern_list, text):
for e in pattern_list:
if stringFound(e, text):
return 1
return 0
df['Output'] = df.apply(lambda x :hasPattern(patterns, x['TextVar']), axis=1)
I tried running this on a list of 5000 sequences ( len(patterns) = 5000
) and with 15000 rows in the dataframe and after 30 minutes it is still running. I realize that I'm actually iterating a potential 75 million times - how could I write this in order to be more time efficient?
Upvotes: 1
Views: 228
Reputation: 210912
In [16]: pat = '|'.join([r'\b{}\b'.format(x) for x in patterns])
In [17]: pat
Out[17]: '\\bfoo bar\\b|\\bsecond random pattern\\b|\\bpink unicorns\\b'
In [18]: df['TextVar'].fillna('').str.contains(pat).astype(np.int8)
Out[18]:
0 0
1 1
2 1
3 0
4 0
5 1
Name: TextVar, dtype: int8
PS in case of using more complex patterns, try to use a pattern from @Wiktor Stribiżew:
pat = r'(?<!\w){}(?!\w)'.format('|'.join([re.escape(m) for m in patterns]))
Upvotes: 1