Reputation: 371
I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href="">
after the comment: <!-- /topOfPage -->
, but I don't really know how to do it. So I have found in the documentation that there is an function which is called handle_comment
, but I haven't found out how to use it correctly. I have the following:
import HTMLParser
class LinkFinder(HTMLParser.HTMLParser):
def __init__(self, *args, **kwargs):
# Can't use super() - HTMLParser is an old-style class
HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
self.in_linktag = False
self.url_cache = []
def handle_comment(self,data):
if data == "topOfPage":
print data
def handle_starttag(self, tag, attrs):
if tag == "a" and any("href" == t[0] for t in attrs): # found link
self.in_linktag = True
self.url_cache.append([dict(attrs)['href']])
def handle_endtag(self, tag):
if tag == "a" and self.in_linktag: # ignore '<a name=""...'
self.in_linktag = False
def handle_data(self, data):
if self.in_linktag:
self.url_cache[-1].append(data)
TESTDATA = """
< html>
< body>
< div>
< ul>
< !-- /topOfPage -->
< tr >
< td class="empty-cell-left"> </td>
< td class="image">
< a href="http://test" rel="nofollow">
< ul>
< /div>
< /body>
< /html>
"""
def main():
lf = LinkFinder()
lf.feed(TESTDATA)
lf.close()
print lf.url_cache
if __name__ == "__main__":
main()
How to do it?
Upvotes: 3
Views: 2897
Reputation: 15817
You need an additional variable to indicate that the parser has just come past to the comment, so that you can save the reference from the first link after it.
def __init__(self, *args, **kwargs):
# ...
self.first_link_after_comment = False
Then, when you encounter the comment, the flag must be switched.
def handle_comment(self, data):
if data.strip() == '/topOfPage':
self.first_link_after_comment = True
When you handle an opening tag, you want to be sure to just make it pass by if the parsing has not passed over the comment
def handle_starttag(self, tag, attrs):
if not self.first_link_after_comment:
return
# ...
Conversely, when you handle the closing tag, you want to acknowledge that the mission has been accomplished.
def handle_endtag(self, tag):
if tag == 'a' and self.in_linktag: # ignore '<a name=""...'
self.in_linktag = False
self.first_link_after_comment = False
Finally, when you append data, just make sure that it's not just a string that's empty or contains white space only.
def handle_data(self, data):
if self.in_linktag and data.strip():
self.url_cache[-1].append(data)
And here you are.
$ your_script.py
[['http://test']]
Upvotes: 2
Reputation: 7844
handle_comment returns all of the data between the delimiters.
In this example, the data would actually be " /topOfPage " (note the spaces and /).
You could also do this instead:
def handle_comment(self,data):
if "topOfPage" in data:
print data
Upvotes: 1