tomsseisums
tomsseisums

Reputation: 13367

Positive lookbehind with a matching group to be extracted

testString = ("<h2>Tricks</h2>"
              "<a href=\"#\"><i class=\"icon-envelope\"></i></a>")
import re
re.sub("(?<=[<h2>(.+?)</h2>\s+])<a href=\"#\"><i class=\"icon-(.+?)\"></i></a>", "{{ \\1 @ \\2 }}", testString)

This produces: invalid group reference.

Making the replacement take only \\1, only extracts envelope, that makes me think that the lookbehind is ignored. Is there a way to extract something from lookbehind?

I'm looking forward to produce:

<h2>Tricks</h2>
{{ Tricks @ envelope }}

Upvotes: 0

Views: 92

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122372

Looks like you really want to use a HTML parser instead. Mixing Regular expressions and HTML get's real painful, really really fast.

In your regular expression, you created a character class (a set of characters that is allowed to match) consisting of <, h, 2, >, etc. here:

[<h2>(.+?)</h2>\s+]

which could have been written as:

[<>h2()+.?/\s]

and it would match the same characters.

Don't use [..] unless you want to create a set of characters for a match (\s, \d, etc. are pre-built character classes).

However, even if you were to remove the brackets, the lookbehind is not allowed. You are not allowed to use variable-width patterns in a lookbehind (no + or *). So, with the character class the lookbehind no longer matches what you think it matches, without it the lookbehind is not permissable.

All in all, just just BeautifulSoup instead.

Upvotes: 1

Related Questions