Tim
Tim

Reputation: 99596

Some questions about Regex in Python

I would like to do some text conversion, such as reading in from a text file:

CONTENTS
1. INTRODUCTION
1.1 The Linear Programming Problem 2
1.2 Examples of Linear Problems 7

and writing to another text file:

("CONTENTS" "#") 
("1. INTRODUCTION" "#") 
("1.1 The Linear Programming Problem 2" "#11")  
("1.2 Examples of Linear Problems 7" "#16")

The current Python code I use for such conversion is:

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d+)) *$',re.M)
def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')
  1. It will convert the original text to

    CONTENTS
    1. INTRODUCTION
    ("1.1 The Linear Programming Problem 2" "#11")
    ("1.2 Examples of Linear Problems 7" "#16")
    

    The last two lines are correct, but the first two lines are not. So I was wondering how to accommodate the first two lines, by modifying the current code, or using some different code?

  2. The code was not written by me, but I would like to understand the usage of re.sub() here. As I found from a Python website,

    re.sub(regex, replacement, subject) performs a search-and-replace across subject, replacing all matches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified.

    But in my code, its usage is `pat.sub(zaa,line)', which seems to me not consistent to the quoted description. So I was wondering how to understand the usage in my code?

Thanks!

Upvotes: 1

Views: 259

Answers (3)

BudgieInWA
BudgieInWA

Reputation: 2265

With your regex you are searching for a line that ends with a number (and maybe trailing whitespace). You could make the number optional: ^(.+?(\d+)?) *$ and make sure your group 2 reference inside zaa can handle an empty string.

def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1), (str(int(mat.group(2))+9) if mat.group(2) else "") )

With this, you should get "#" when mat.group(2) is empty, and what your currently get, when it's not empty.

Upvotes: 3

ridgerunner
ridgerunner

Reputation: 34435

This tested script generates the desired output:

import re
infilename = "infile.txt"
outfilename = "outfile.txt"

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d*)) *$',re.M)
def zaa(mat):
    if mat.group(2):
        return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))
    else:
        return '("%s" "#")' % (mat.group(1))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')

Upvotes: 2

senderle
senderle

Reputation: 151177

But in my code, its usage is pat.sub(zaa,line), which seems to me not consistent to the quoted description.

The difference is in the sub call; the documentation you quote is to the re.sub function, but what is being used here is the sub method of a compiled regular expression object. The initial pattern argument in re.sub() is replaced with the regular expression object to which the sub method is bound. So in other words,

pat.sub(zaa, line)

is equivalent to

re.sub(pat, zaa, line)

Terrible variable names by the way.

Upvotes: 1

Related Questions