royskatt
royskatt

Reputation: 1210

List of compiled regexes in Python

I have a lot of substitution patterns which I need for text cleaning. I load the data from a database and compile the regular expressions before for performance reasons. Unfortunately with my approach only the last assignment of the variable "text" seems to be valid, while the others appear to be overwritten:

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()

Does anyone know how to solve this in a working, elegant way? Or do I have to pound this through huge if/elif control structure?

Upvotes: 1

Views: 219

Answers (4)

Kirk Strauser
Kirk Strauser

Reputation: 30947

Might I suggest building a list of patterns and their replacement values, then iterating across it? Then you don't have to modify the function every time you want to update the patterns:

import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
    (re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]

def extract_from_db():
    for row in cursor:
        text = str(row[0])
        for pattern, replacement in REPLACEMENTS:
            text = pattern.sub(replacement, text)

        print text

extract_from_db()

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1123400

You keep replacing the last result with a replacement on str(row[0]). Use text instead to accumulate substitutions:

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)

You'd be better of using an actual list instead:

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]

and use those in a loop:

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)

or using functools.partial() to simplify the loop a bit further:

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]

and the loop:

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)

or using the above list of patterns wrapped in partial() objects, and reduce():

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])

Upvotes: 7

timrau
timrau

Reputation: 23058

It looks like what you need is:

    text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
    text = REPLACE_2.sub(r'REPLACE_1',text)
    # ..
    text = REPLACE_99.sub(r'REPLACE_99',text)
    text = REPLACE_100.sub(r'REPLACE_199',text)

Upvotes: 1

Corley Brigman
Corley Brigman

Reputation: 12401

Your approach is fine; however, on every line, you are applying the regex to the original string. You need to apply it to the result of the previous line, i.e.:

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        # This one stays the same - initialize from the row
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        # For these, route text back into it
        text = REPLACE_2.sub(r'REPLACE_2',text)
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',text)
        text = REPLACE_100.sub(r'REPLACE_100',text)
        print text

Upvotes: 1

Related Questions