Reading regexes from file, in Python

Question

I am trying to read a bunch of regexes from a file, using python.

The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.

<\? xml([^>]*?)>,<\? XML$1>
peter,Peter

I am doing

detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
    line = line.strip()
    [search_term, replace_term] = line.split(',', 1)
    detergent += [[search_term,replace_term]]

This is not producing the right input. If I print the detergent I get

['<\?xml([^>]*?)>', '<\?HEYXML$1>'],['peter','Peter']]

It seems to be that it is escaping the backslashes.

Moreover, in a file containing, say

a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be

<\? XML$1>

So, the $1 is not recovering the first capture group in the first regex of the pair.

What is the proper way to input regexes from a file to be later used in re.sub?

When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.

donkopotamus · Accepted Answer

There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...

>>> r"\?"
>>> '\?'

The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Reading regexes from file, in Python

Answers (1)

Related Questions