Antonio
Antonio

Reputation: 61

re.findall and regex

I need to get the names of something like this content:

<p>
<a name="blu" title="blu"></a>orense
</p>
<p>
<a name="bla" title="bla"></a>toledo
</p>
<p>
<a name="blo" title="blo"></a>sevilla
</p>

but with this code:

names = []
matches = re.findall(r'''<a\stitle="(?P<title>[^">]+)"\sname="(?P<name>[^">]+)"></a>''',content, re.VERBOSE)
for (title, name) in matches:
    if title == name:
        names.append(title)
return names

...I get names=[ ]; what is wrong?. Thanks.

Upvotes: 1

Views: 220

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

Uh, well obviously, in your sample text, name comes before title, and in your regex, title is expected before name. This is precisely the reason (or one of them) why you should be using an HTML parser instead. Try BeautifulSoup for example.

If you insist on regex, just turn the parameters around (and make sure that you'll never get those attributes in a different order, and never any other attributes than those):

names = []
matches = re.findall(r'''<a\sname="(?P<name>[^">]+)"\stitle="(?P<title>[^">]+)"></a>''',content, re.VERBOSE)
for (name, title) in matches:
    if title == name:
        names.append(title)

Result:

>>> names
['blu', 'bla', 'blo']

Upvotes: 4

Related Questions