html parser python

Question

I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first after the comment: , but I don't really know how to do it. So I have found in the documentation that there is an function which is called handle_comment, but I haven't found out how to use it correctly. I have the following:

import HTMLParser

class LinkFinder(HTMLParser.HTMLParser):
def __init__(self, *args, **kwargs):
    # Can't use super() - HTMLParser is an old-style class
    HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
    self.in_linktag = False
    self.url_cache = []
def handle_comment(self,data):
    if data == "topOfPage":
        print data
def handle_starttag(self, tag, attrs):
    if tag == "a" and any("href" == t[0] for t in attrs): # found link
        self.in_linktag = True
        self.url_cache.append([dict(attrs)['href']])
def handle_endtag(self, tag):
    if tag == "a" and self.in_linktag: # ignore '
< body>
< div>
 < ul>
    < !-- /topOfPage --> 
< tr >
    < td class="empty-cell-left"> 
    < td class="image">


    < a  href="http://test" rel="nofollow">
 < ul>
< /div>
< /body>
 < /html>
"""
def main():
lf = LinkFinder()
lf.feed(TESTDATA)
lf.close()
print lf.url_cache
if __name__ == "__main__":
    main()

How to do it?

Giulio Piancastelli · Accepted Answer

You need an additional variable to indicate that the parser has just come past to the comment, so that you can save the reference from the first link after it.

def __init__(self, *args, **kwargs):
    # ...
    self.first_link_after_comment = False

Then, when you encounter the comment, the flag must be switched.

def handle_comment(self, data):
    if data.strip() == '/topOfPage':
        self.first_link_after_comment = True

When you handle an opening tag, you want to be sure to just make it pass by if the parsing has not passed over the comment

def handle_starttag(self, tag, attrs):
    if not self.first_link_after_comment:
        return
    # ...

Conversely, when you handle the closing tag, you want to acknowledge that the mission has been accomplished.

def handle_endtag(self, tag):
    if tag == 'a' and self.in_linktag: # ignore '



Finally, when you append data, just make sure that it's not just a string that's empty or contains white space only.

def handle_data(self, data):
    if self.in_linktag and data.strip():
        self.url_cache[-1].append(data)


And here you are.

$ your_script.py
[['http://test']]

html parser python

Answers (2)

Related Questions