user1010775
user1010775

Reputation: 371

html parser python

I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href=""> after the comment: <!-- /topOfPage -->, but I don't really know how to do it. So I have found in the documentation that there is an function which is called handle_comment, but I haven't found out how to use it correctly. I have the following:

import HTMLParser

class LinkFinder(HTMLParser.HTMLParser):
def __init__(self, *args, **kwargs):
    # Can't use super() - HTMLParser is an old-style class
    HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
    self.in_linktag = False
    self.url_cache = []
def handle_comment(self,data):
    if data == "topOfPage":
        print data
def handle_starttag(self, tag, attrs):
    if tag == "a" and any("href" == t[0] for t in attrs): # found link
        self.in_linktag = True
        self.url_cache.append([dict(attrs)['href']])
def handle_endtag(self, tag):
    if tag == "a" and self.in_linktag: # ignore '<a name=""...'
        self.in_linktag = False
def handle_data(self, data):
    if self.in_linktag:
        self.url_cache[-1].append(data)
TESTDATA = """
< html>
< body>
< div>
 < ul>
    < !-- /topOfPage --> 
< tr >
    < td class="empty-cell-left">&nbsp;</td>
    < td class="image">


    < a  href="http://test" rel="nofollow">
 < ul>
< /div>
< /body>
 < /html>
"""
def main():
lf = LinkFinder()
lf.feed(TESTDATA)
lf.close()
print lf.url_cache
if __name__ == "__main__":
    main()

How to do it?

Upvotes: 3

Views: 2897

Answers (2)

Giulio Piancastelli
Giulio Piancastelli

Reputation: 15817

You need an additional variable to indicate that the parser has just come past to the comment, so that you can save the reference from the first link after it.

def __init__(self, *args, **kwargs):
    # ...
    self.first_link_after_comment = False

Then, when you encounter the comment, the flag must be switched.

def handle_comment(self, data):
    if data.strip() == '/topOfPage':
        self.first_link_after_comment = True

When you handle an opening tag, you want to be sure to just make it pass by if the parsing has not passed over the comment

def handle_starttag(self, tag, attrs):
    if not self.first_link_after_comment:
        return
    # ...

Conversely, when you handle the closing tag, you want to acknowledge that the mission has been accomplished.

def handle_endtag(self, tag):
    if tag == 'a' and self.in_linktag: # ignore '<a name=""...'
        self.in_linktag = False
        self.first_link_after_comment = False

Finally, when you append data, just make sure that it's not just a string that's empty or contains white space only.

def handle_data(self, data):
    if self.in_linktag and data.strip():
        self.url_cache[-1].append(data)

And here you are.

$ your_script.py
[['http://test']]

Upvotes: 2

Scott A
Scott A

Reputation: 7844

handle_comment returns all of the data between the delimiters.

In this example, the data would actually be " /topOfPage " (note the spaces and /).

You could also do this instead:

def handle_comment(self,data):
    if "topOfPage" in data:
        print data

Upvotes: 1

Related Questions