John Gayle
John Gayle

Reputation: 43

Find numbers in a string and decrement them

I have an HTML page that lists a long index of topics and page numbers. I want to find all the page numbers and their anchor tag links and decrement the page numbers by 1.

Here is an example line in the HTML:

<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>

I'm trying to find the number 28 in both places and decrement by 1.

So far I've been able to find the number and replace it with itself, but I can't figure out how to decrement it. My code so far:

import fileinput
import re

for line in fileinput.input():
    line = re.sub(r'\>([0-9]+)\<', r'>\1<', line.rstrip())
    print(line)

Upvotes: 3

Views: 684

Answers (2)

JuniorCompressor
JuniorCompressor

Reputation: 20025

You can use a replacement function while substituting:

import re
s = '<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>'
re.sub(r'page(\d+)">\1', lambda m: 'page{0}">{0}'.format(int(m.group(1)) - 1), s)

Result:

<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page27">27</a></p>

With page(\d+)">\1 we match page followed by a number, followed by a ">, followed by the same number as in the pattern in the first pair of parentheses (\1).

The substitution function takes as parameter a match. So we take the first group of the match (m.group(1)), which is the number, we parse it and decrement it. Then we reconstruct the new string using the decremented number.

Upvotes: 3

jonrsharpe
jonrsharpe

Reputation: 122052

Note that you can pass a function as the repl argument to re.sub, which will be passed a single match object "for every non-overlapping occurrence of pattern":

def decrement(match):
    """Decrement the number in the match."""
    return str(int(match.group()) - 1)

Note that this is expecting match.group() to represent an integer; to only capture the number, and not include the > and <, use lookarounds (see demo):

page_num = re.compile(r'''
    (?<=>) # a > before the group
    \d+    # followed by one or more digits
    (?=<)  # and a < after the group
''', re.VERBOSE)

This works as you require:

>>> page_num.sub(decrement, line)
'<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">27</a></p>'

and can be applied similarly for '#page28"'.

However, note that you should generally use an actual HTML parser, not regular expressions, for parsing HTML (which isn't a regular language).

Upvotes: 1

Related Questions