Reputation: 18387
Consider the following:
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
<a href="/sitemap">Sitemap</a>
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Upvotes: 6
Views: 21214
Reputation: 44709
In order to extract the contents of the tagline:
<a href="/sitemap">Sitemap</a>
... I would use:
>>> import re
>>> s = '''
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>'''
>>> m = re.compile(r'<a href="/sitemap">(.*?)</a>').search(s)
>>> m.group(1)
'Sitemap'
Upvotes: 5
Reputation: 739
Use BeautifulSoup or lxml if you need to parse HTML.
Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?
If you really have to use regular expressions, have a look at findall
.
Upvotes: 1
Reputation: 33593
Parsing HTML with regular expression is a bad idea!
Think about the following piece of html
<a></a > <!-- legal html, but won't pass your regex -->
<a href="/sitemap">Sitemap<!-- proof that a>b iff ab>1 --></a>
There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.
You should consider using Beautiful Soup python HTML parser.
Anyhow, a ad-hoc solution using regex is
import re
data = """
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>
"""
e = re.compile('<a *[^>]*>.*</a *>')
print e.findall(data)
Output:
>>> e.findall(data)
['<a href="foo1.com">Foo1</a>', '<a href="/">Home</a>', '<a href="/extract">Extract</a>', '<a href="/sitemap">Sitemap</a>']
Upvotes: 6
Reputation: 46773
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# <a href="/sitemap">Sitemap</a>
Upvotes: 13