Reputation: 13367
testString = ("<h2>Tricks</h2>"
"<a href=\"#\"><i class=\"icon-envelope\"></i></a>")
import re
re.sub("(?<=[<h2>(.+?)</h2>\s+])<a href=\"#\"><i class=\"icon-(.+?)\"></i></a>", "{{ \\1 @ \\2 }}", testString)
This produces: invalid group reference
.
Making the replacement take only \\1
, only extracts envelope
, that makes me think that the lookbehind is ignored. Is there a way to extract something from lookbehind?
I'm looking forward to produce:
<h2>Tricks</h2>
{{ Tricks @ envelope }}
Upvotes: 0
Views: 92
Reputation: 1122372
Looks like you really want to use a HTML parser instead. Mixing Regular expressions and HTML get's real painful, really really fast.
In your regular expression, you created a character class (a set of characters that is allowed to match) consisting of <, h, 2, >, etc. here:
[<h2>(.+?)</h2>\s+]
which could have been written as:
[<>h2()+.?/\s]
and it would match the same characters.
Don't use [..] unless you want to create a set of characters for a match (\s, \d, etc. are pre-built character classes).
However, even if you were to remove the brackets, the lookbehind is not allowed. You are not allowed to use variable-width patterns in a lookbehind (no + or *). So, with the character class the lookbehind no longer matches what you think it matches, without it the lookbehind is not permissable.
All in all, just just BeautifulSoup instead.
Upvotes: 1