Reputation: 2509

Python and HTMLParser.handle_data() - How to get data from tags?

I'm trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I'm not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)


url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)
parser.feed(str(page))

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

Upvotes: 4

Answers (3)

hwang

Reputation: 46

class HTMLParse(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = True
    def handle_endtag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = False
    def handle_data(self, data):
        if self.recordh2:
            # do your work here

Upvotes: 1

Yanan

Reputation: 24

html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()
hp.feed(data)
hp.close()

thus you can extract data from h2 tag, hope it can help

Upvotes: 0

user393899

Reputation: 53

I don't have time to format/clean this up it but this is how I usually do it:

        class HTMLParse(HTMLParser.HTMLParser):
            def handle_starttag(self, tag, attr):
                if tag.lower() == "a":
                    for item in attr:
                        #print item
                        if item[0].lower() == "href":
                            path = urlparse.urlparse(item[1]).path
                            ext = os.path.splitext(path)[1]
                            if ext.lower() in (".jpeg", ".jpg", ".png",
                                               ".bmp"):
                                print "Found: "+ item[1]

Upvotes: 0

Python and HTMLParser.handle_data() - How to get data from tags?

Answers (3)

Related Questions