Xodarap777
Xodarap777

Reputation: 1376

How to efficiently loop regex searches from set python

I'm evaluating a long list of sentences to see if they include state names and map them with a dict, and this is the code I came up with -- it's painfully slow. How should this be done correctly?

       for sent in sentences: #set of sentences.upper()
            for state in stateset: #set of state abbrev's and names in .upper()
                boundst = re.compile(r'\b%s\b' % state, re.I)
                if re.search(boundst, sent):
                    sentstatedict[sent] = state
                    break

I don't know how to create the bound versions ahead of time - can I create a set of them and use it?

To be clear, I wanted to find out, for each sentence I had, at most one matching state name or abbreviation contained in that sentence. My difficulty was in not knowing how to pre-assemble a list of usable "bound" versions of the state strings for "whole word" matching. That led me to having the re.compile in the inner loop.

Upvotes: 1

Views: 171

Answers (3)

Nir Alfasi
Nir Alfasi

Reputation: 53565

Since you don't want to compile the regex over and over again (n^2 times instead of n times...). It will be more efficient to change the order of the for-loops and compile the regex in-between:

for state in stateset:             
    boundst = re.compile(r'\b%s\b' % state, re.I) # compile the regex once (at most) per each state
        for sent in sentences: 
            if re.search(boundst, sent):
                sentstatedict[sent] = state
                break

Upvotes: 0

Blckknght
Blckknght

Reputation: 104852

I suggest that you build a single regex that matches all of your states. Then you can do a single, more complicated regex search against each sentance and extract the matched state from the result:

pattern = r"\b({})\b".format("|".join(stateset))
for sent in sentences:
    match = re.search(pattern, sent, re.I)
    if match:
        sentstatedict[sent] = match.group(1)

I'm not bothering with re.compile because all the regex methods that take string patterns will cache the compiled pattern internally. So searching with the same pattern string should be just as fast as calling compile yourself and then using the compiled pattern's methods.

Upvotes: 1

BrianO
BrianO

Reputation: 1524

You're compiling all the regular expressions over and over again (N times! where N is the number of sentences). re.compile isn't a speedy operation, so that's what's causing the pain. You can initialize a dict of them so you can look them up by state:

re_lookup = {
    state: re.compile(r'\b%s\b' % state, re.I)
    for state in stateset
}

for sent in sentences:
    for state in stateset:
        if re.search(re_lookup[state], sent):
            sentstatedict[sent] = state
            break

Upvotes: 2

Related Questions