Reputation: 99596
I would like to do some text conversion, such as reading in from a text file:
CONTENTS
1. INTRODUCTION
1.1 The Linear Programming Problem 2
1.2 Examples of Linear Problems 7
and writing to another text file:
("CONTENTS" "#")
("1. INTRODUCTION" "#")
("1.1 The Linear Programming Problem 2" "#11")
("1.2 Examples of Linear Problems 7" "#16")
The current Python code I use for such conversion is:
infile = open(infilename)
outfile = open(outfilename, "w")
pat = re.compile('^(.+?(\d+)) *$',re.M)
def zaa(mat):
return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))
outfile.write('(bookmarks \n')
for line in infile:
outfile.write(pat.sub(zaa,line))
outfile.write(')')
It will convert the original text to
CONTENTS
1. INTRODUCTION
("1.1 The Linear Programming Problem 2" "#11")
("1.2 Examples of Linear Problems 7" "#16")
The last two lines are correct, but the first two lines are not. So I was wondering how to accommodate the first two lines, by modifying the current code, or using some different code?
The code was not written by me, but
I would like to understand the usage
of re.sub()
here. As I found from
a Python website,
re.sub(regex, replacement, subject) performs a search-and-replace across subject, replacing all matches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified.
But in my code, its usage is `pat.sub(zaa,line)', which seems to me not consistent to the quoted description. So I was wondering how to understand the usage in my code?
Thanks!
Upvotes: 1
Views: 259
Reputation: 2265
With your regex you are searching for a line that ends with a number (and maybe trailing whitespace). You could make the number optional: ^(.+?(\d+)?) *$
and make sure your group 2 reference inside zaa
can handle an empty string.
def zaa(mat):
return '("%s" "#%s")' % (mat.group(1), (str(int(mat.group(2))+9) if mat.group(2) else "") )
With this, you should get "#" when mat.group(2)
is empty, and what your currently get, when it's not empty.
Upvotes: 3
Reputation: 34435
This tested script generates the desired output:
import re
infilename = "infile.txt"
outfilename = "outfile.txt"
infile = open(infilename)
outfile = open(outfilename, "w")
pat = re.compile('^(.+?(\d*)) *$',re.M)
def zaa(mat):
if mat.group(2):
return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))
else:
return '("%s" "#")' % (mat.group(1))
outfile.write('(bookmarks \n')
for line in infile:
outfile.write(pat.sub(zaa,line))
outfile.write(')')
Upvotes: 2
Reputation: 151177
But in my code, its usage is
pat.sub(zaa,line)
, which seems to me not consistent to the quoted description.
The difference is in the sub
call; the documentation you quote is to the re.sub
function, but what is being used here is the sub
method of a compiled regular expression object. The initial pattern argument in re.sub()
is replaced with the regular expression object to which the sub
method is bound. So in other words,
pat.sub(zaa, line)
is equivalent to
re.sub(pat, zaa, line)
Terrible variable names by the way.
Upvotes: 1