Reputation: 61
I need to get the names of something like this content:
<p>
<a name="blu" title="blu"></a>orense
</p>
<p>
<a name="bla" title="bla"></a>toledo
</p>
<p>
<a name="blo" title="blo"></a>sevilla
</p>
but with this code:
names = []
matches = re.findall(r'''<a\stitle="(?P<title>[^">]+)"\sname="(?P<name>[^">]+)"></a>''',content, re.VERBOSE)
for (title, name) in matches:
if title == name:
names.append(title)
return names
...I get names=[ ]; what is wrong?. Thanks.
Upvotes: 1
Views: 220
Reputation: 336108
Uh, well obviously, in your sample text, name
comes before title
, and in your regex, title
is expected before name
. This is precisely the reason (or one of them) why you should be using an HTML parser instead. Try BeautifulSoup for example.
If you insist on regex, just turn the parameters around (and make sure that you'll never get those attributes in a different order, and never any other attributes than those):
names = []
matches = re.findall(r'''<a\sname="(?P<name>[^">]+)"\stitle="(?P<title>[^">]+)"></a>''',content, re.VERBOSE)
for (name, title) in matches:
if title == name:
names.append(title)
Result:
>>> names
['blu', 'bla', 'blo']
Upvotes: 4