myfirsttime1
myfirsttime1

Reputation: 287

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.

The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.

<\? xml([^>]*?)>,<\? XML$1>
peter,Peter

I am doing

detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
    line = line.strip()
    [search_term, replace_term] = line.split(',', 1)
    detergent += [[search_term,replace_term]]

This is not producing the right input. If I print the detergent I get

['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]

It seems to be that it is escaping the backslashes.

Moreover, in a file containing, say

<? xml ........>

a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be

<\? XML$1>

So, the $1 is not recovering the first capture group in the first regex of the pair.

What is the proper way to input regexes from a file to be later used in re.sub?

When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.

Upvotes: 1

Views: 54

Answers (1)

donkopotamus
donkopotamus

Reputation: 23176

There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...

>>> r"\?"
>>> '\\?'

The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Upvotes: 2

Related Questions