Reputation: 1398
I have this small class:
class HTMLTagStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def handle_starttag(self, tag, attrs):
if tag == 'a':
return attrs[0][1]
def get_data(self):
return ''.join(self.fed)
parsing this HTML code:
<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>
This is the result I get: long text click here
but I want to get: long text click somelink.com
Is there a way to do this?
Upvotes: 0
Views: 582
Reputation: 16
I was actually checking out this new html parser library and come up with this solution:
from htmldom import htmldom
dom = htmldom.HtmlDom().createDom( """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>""");
nodes = dom.find( "p" ).children( all_children = True ) # this makes all text nodes to be in the set.
for node in nodes:
if node._is( "a" ):
print( node.attr( "href" ).strip() )
elif node._is( "text" ):
print( node.getNode().text, end = '', sep = ' ' )
You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)
Upvotes: 0
Reputation: 143017
Take a look at BeautifulSoup .. it will do that and much more.
Or you could use regular expressions/string operations to strip out the data you want. In the long run using something like BeautifulSoup will pay off, especially if you expect to do more of this.
Here's one way to use BeautifulSoup to extract the single/only link in your HTML data (I'm not an expert with this, so there may be other, better ways - suggestions/corrections welcome).
from BeautifulSoup import BeautifulSoup
s = """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>"""
soup = BeautifulSoup(s)
your_link = soup.find('a', href=True)['href']
print 'long text click', your_link
will print:
long text click somelink.com
Upvotes: 8
Reputation: 77902
Replacing this:
def handle_starttag(self, tag, attrs):
if tag == 'a':
return attrs[0][1]
With this:
def handle_starttag(self, tag, attrs):
if tag == 'a':
value = dict(attrs).get("href", None)
if value:
# add extra spaces since you dont sanitize
# them in get_data
self.fed.append(" %s " % value)
should kind of work. Or not, depending on the html source code. That's why we have BeatifulSoup.
Upvotes: 0
Reputation: 1837
This WILL NOT work for you:
x = re.compile(r'<.*?>')
stripped = x.sub('', html)
as you also would like to extract some properties (like href) from the html tags.
As Levon points out: you should go for BeautifulSoup.
Upvotes: 0