Reputation: 1376
I'm evaluating a long list of sentences to see if they include state names and map them with a dict, and this is the code I came up with -- it's painfully slow. How should this be done correctly?
for sent in sentences: #set of sentences.upper()
for state in stateset: #set of state abbrev's and names in .upper()
boundst = re.compile(r'\b%s\b' % state, re.I)
if re.search(boundst, sent):
sentstatedict[sent] = state
break
I don't know how to create the bound versions ahead of time - can I create a set of them and use it?
To be clear, I wanted to find out, for each sentence I had, at most one matching state name or abbreviation contained in that sentence. My difficulty was in not knowing how to pre-assemble a list of usable "bound" versions of the state strings for "whole word" matching. That led me to having the re.compile
in the inner loop.
Upvotes: 1
Views: 171
Reputation: 53565
Since you don't want to compile the regex over and over again (n^2
times instead of n
times...). It will be more efficient to change the order of the for-loops and compile the regex in-between:
for state in stateset:
boundst = re.compile(r'\b%s\b' % state, re.I) # compile the regex once (at most) per each state
for sent in sentences:
if re.search(boundst, sent):
sentstatedict[sent] = state
break
Upvotes: 0
Reputation: 104852
I suggest that you build a single regex that matches all of your states. Then you can do a single, more complicated regex search against each sentance and extract the matched state from the result:
pattern = r"\b({})\b".format("|".join(stateset))
for sent in sentences:
match = re.search(pattern, sent, re.I)
if match:
sentstatedict[sent] = match.group(1)
I'm not bothering with re.compile
because all the regex methods that take string patterns will cache the compiled pattern internally. So searching with the same pattern string should be just as fast as calling compile
yourself and then using the compiled pattern's methods.
Upvotes: 1
Reputation: 1524
You're compiling all the regular expressions over and over again (N times! where N is the number of sentences). re.compile
isn't a speedy operation, so that's what's causing the pain. You can initialize a dict of them so you can look them up by state:
re_lookup = {
state: re.compile(r'\b%s\b' % state, re.I)
for state in stateset
}
for sent in sentences:
for state in stateset:
if re.search(re_lookup[state], sent):
sentstatedict[sent] = state
break
Upvotes: 2