Reputation: 43
I have an HTML page that lists a long index of topics and page numbers. I want to find all the page numbers and their anchor tag links and decrement the page numbers by 1
.
Here is an example line in the HTML:
<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>
I'm trying to find the number 28
in both places and decrement by 1
.
So far I've been able to find the number and replace it with itself, but I can't figure out how to decrement it. My code so far:
import fileinput
import re
for line in fileinput.input():
line = re.sub(r'\>([0-9]+)\<', r'>\1<', line.rstrip())
print(line)
Upvotes: 3
Views: 684
Reputation: 20025
You can use a replacement function while substituting:
import re
s = '<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">28</a></p>'
re.sub(r'page(\d+)">\1', lambda m: 'page{0}">{0}'.format(int(m.group(1)) - 1), s)
Result:
<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page27">27</a></p>
With page(\d+)">\1
we match page followed by a number, followed by a ">, followed by the same number as in the pattern in the first pair of parentheses (\1
).
The substitution function takes as parameter a match. So we take the first group of the match (m.group(1)
), which is the number, we parse it and decrement it. Then we reconstruct the new string using the decremented number.
Upvotes: 3
Reputation: 122052
Note that you can pass a function as the repl
argument to re.sub
, which will be passed a single match
object "for every non-overlapping occurrence of pattern
":
def decrement(match):
"""Decrement the number in the match."""
return str(int(match.group()) - 1)
Note that this is expecting match.group()
to represent an integer; to only capture the number, and not include the >
and <
, use lookarounds (see demo):
page_num = re.compile(r'''
(?<=>) # a > before the group
\d+ # followed by one or more digits
(?=<) # and a < after the group
''', re.VERBOSE)
This works as you require:
>>> page_num.sub(decrement, line)
'<p class="index">breakeven volume (BEV), <a href="ch02.xhtml#page28">27</a></p>'
and can be applied similarly for '#page28"'
.
However, note that you should generally use an actual HTML parser, not regular expressions, for parsing HTML (which isn't a regular language).
Upvotes: 1