Reputation: 1210
I have a lot of substitution patterns which I need for text cleaning. I load the data from a database and compile the regular expressions before for performance reasons. Unfortunately with my approach only the last assignment of the variable "text" seems to be valid, while the others appear to be overwritten:
# -*- coding: utf-8 -*-
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
# ..
text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
print text
extract_from_db()
Does anyone know how to solve this in a working, elegant way? Or do I have to pound this through huge if/elif control structure?
Upvotes: 1
Views: 219
Reputation: 30947
Might I suggest building a list of patterns and their replacement values, then iterating across it? Then you don't have to modify the function every time you want to update the patterns:
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
REPLACEMENTS = [
(re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
(re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
(re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
(re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]
def extract_from_db():
for row in cursor:
text = str(row[0])
for pattern, replacement in REPLACEMENTS:
text = pattern.sub(replacement, text)
print text
extract_from_db()
Upvotes: 1
Reputation: 1123400
You keep replacing the last result with a replacement on str(row[0])
. Use text
instead to accumulate substitutions:
text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)
You'd be better of using an actual list instead:
REPLACEMENTS = [
(re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
(re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
# ..
(re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
(re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]
and use those in a loop:
text = str(row[0])
for pattern, replacement in REPLACEMENTS:
text = pattern.sub(replacement, text)
or using functools.partial()
to simplify the loop a bit further:
from functools import partial
REPLACEMENTS = [
partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
# ..
partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]
and the loop:
text = str(row[0])
for replacement in REPLACEMENTS:
text = replacement(text)
or using the above list of patterns wrapped in partial()
objects, and reduce()
:
text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])
Upvotes: 7
Reputation: 23058
It looks like what you need is:
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_1',text)
# ..
text = REPLACE_99.sub(r'REPLACE_99',text)
text = REPLACE_100.sub(r'REPLACE_199',text)
Upvotes: 1
Reputation: 12401
Your approach is fine; however, on every line, you are applying the regex to the original string. You need to apply it to the result of the previous line, i.e.:
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
# This one stays the same - initialize from the row
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
# For these, route text back into it
text = REPLACE_2.sub(r'REPLACE_2',text)
# ..
text = REPLACE_99.sub(r'REPLACE_99',text)
text = REPLACE_100.sub(r'REPLACE_100',text)
print text
Upvotes: 1