Viszman
Viszman

Reputation: 1398

I want to parse HTML in python

I have this small class:

class HTMLTagStripper(HTMLParser):
    def __init__(self):
       self.reset()
       self.fed = []
    def handle_data(self, data):
       self.fed.append(data)
    def handle_starttag(self, tag, attrs):
       if tag == 'a':
           return attrs[0][1]
    def get_data(self):
       return ''.join(self.fed)

parsing this HTML code:

<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>

This is the result I get: long text click here
but I want to get: long text click somelink.com

Is there a way to do this?

Upvotes: 0

Views: 582

Answers (4)

coder
coder

Reputation: 16

I was actually checking out this new html parser library and come up with this solution:

from htmldom import htmldom
dom = htmldom.HtmlDom().createDom( """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>""");
nodes = dom.find( "p" ).children( all_children = True ) # this makes all text nodes to be in the set.
for node in nodes:
    if node._is( "a" ):
        print( node.attr( "href" ).strip() )
    elif node._is( "text" ):
        print( node.getNode().text, end = '', sep = ' ' )

You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)

Upvotes: 0

Levon
Levon

Reputation: 143017

Take a look at BeautifulSoup .. it will do that and much more.

Or you could use regular expressions/string operations to strip out the data you want. In the long run using something like BeautifulSoup will pay off, especially if you expect to do more of this.

Here's one way to use BeautifulSoup to extract the single/only link in your HTML data (I'm not an expert with this, so there may be other, better ways - suggestions/corrections welcome).

from BeautifulSoup import BeautifulSoup
s = """<div id="footer">
       <p>long text.</p>
       <p>click <a href="somelink.com">here</a>
       </div>"""

soup = BeautifulSoup(s)
your_link = soup.find('a', href=True)['href']
print 'long text click', your_link

will print:

long text click somelink.com

Upvotes: 8

bruno desthuilliers
bruno desthuilliers

Reputation: 77902

Replacing this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       return attrs[0][1]

With this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       value = dict(attrs).get("href", None)
       if value:
           # add extra spaces since you dont sanitize
           # them in get_data
           self.fed.append(" %s " % value)

should kind of work. Or not, depending on the html source code. That's why we have BeatifulSoup.

Upvotes: 0

bcelary
bcelary

Reputation: 1837

This WILL NOT work for you:

x = re.compile(r'<.*?>')
stripped = x.sub('', html)

as you also would like to extract some properties (like href) from the html tags.

As Levon points out: you should go for BeautifulSoup.

Upvotes: 0

Related Questions