Reputation: 5274
I have a rather simple set of requirements. I have a list (of length 2 million) of objects, each with 2 properties that need to be regexed (other properties are not changed)
Values of ZERO ONE TWO ... TEN need to be changed to their numeric value: 1 2 ... 10
Examples:
ONE MAIN STREET -> 1 MAIN STREET
BONE ROAD -> BONE ROAD
BUILDING TWO, THREE MAIN ROAD -> BUILDING 2, 3 MAIN ROAD
ELEVEN MAIN ST -> ELEVEN MAIN STREET
ONE HUNDRED FUNTOWN -> 1 HUNDRED FUNTOWN
Clearly are some numbers that do not get changed and some charged oddly. that is completely expected
I can get it all to work with what I have below. My question is, is there a clever way to make this all run faster? I've thought of making an list
of dictionaries
where the keys are the word-numbers and values are numeric, but I don't think that will help with performance. Or re.compile
each regex and pass them into this function? Any clever idea out there to make this run faster?
def update_word_to_numeric(entrylist):
updated_entrylist = []
for theentry in entrylist:
theentry.addr_ln_1 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bONE\b", "1", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_1)
theentry.addr_ln_2 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bONE\b", "1", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_2)
updated_entrylist.append(theentry)
return updated_entrylist
Maybe this is just a fine way to do it. Comments of "that's good enough" are good with me too :)
Upvotes: 4
Views: 86
Reputation: 47292
Here's an approach using a dictionary:
s = '''
ONE MAIN STREET
BONE ROAD
BUILDING TWO, THREE MAIN ROAD
ELEVEN MAIN ST
ONE HUNDRED FUNTOWN
'''
d = {'ZERO':'0', 'ONE':'1', 'TWO':'2', 'THREE':'3', 'FOUR':'4',
'FIVE':'5', 'SIX':'6', 'SEVEN':'7', 'EIGHT':'8', 'NINE':'9',
'TEN':'10', 'ELEVEN':'11', 'TWELVE':'12'}
p = re.compile(r'\b(' + '|'.join(d.keys()) + r')\b')
r = p.sub(lambda x: d[x.group()], s)
print(r)
Add or remove entries from the dictionary as you see fit.
Upvotes: 3
Reputation: 27331
It's much faster to use one regular expression instead of ten (I noticed a 3x increase in speed):
def replace(match):
return {
"ZERO": "0",
"ONE": "1",
"TWO": "2",
"THREE": "3",
"FOUR": "4",
"FIVE": "5",
"SIX": "6",
"SEVEN": "7",
"EIGHT": "8",
"NINE": "9",
"TEN": "10",
}[match.group(1)]
pattern = re.compile(r"\b(ZERO|ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)\b")
def update_word_to_numeric(entrylist):
updated_entrylist = []
for theentry in entrylist:
theentry.addr_ln_1 = pattern.sub(replace, theentry.addr_ln_1)
theentry.addr_ln_2 = pattern.sub(replace, theentry.addr_ln_2)
updated_entrylist.append(theentry)
return updated_entrylist
I'm using the little-known functionality to hand re.sub
a function as the second argument: It will take a match object and return the replacement string. That way we can look up the replacement string.
I also used re.compile
to precompile the regex, this also improved the time, but not as much as the big change.
Upvotes: 5
Reputation: 7412
numbers = ["\bZERO\b", "\bONE\b", "\bTWO\b", "\bTHREE\b", "\bFOUR\b", "\bFIVE\b", "\bSIX\b", "\bSEVEN\b", "\bEIGHT\b", "\bNINE\b", "\bTEN\b"]
for theentry in entrylist:
for i, number in enumerate(numbers):
theentry.addr_ln_1 = re.sub(r"{}".format(number), "{}".format(i), theentry.addr_ln_1)
theentry.addr_ln_2 = re.sub(r"{}".format(number), "{}".format(i), theentry.addr_ln_2)
Upvotes: 1