Reputation: 287
I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv
, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent
I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content)
written further below in the content is replacing it to be
<\? XML$1>
So, the $1
is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub
?
When I've had the regexes inside the script I would write them inside the r'...'
, but I am not sure what are the issues at hand when reading form a file.
Upvotes: 1
Views: 54
Reputation: 23176
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?"
directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1
is not being replaced is because this is not the syntax for group references. The correct syntax is \1
.
Upvotes: 2