Reputation: 2509
I'm trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I'm not sure how to do it. This is the code I have so far:
import urllib.request
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
url = "website"
page = urllib.request.urlopen(url).read()
parser = MyHTMLParser(strict=False)
parser.feed(str(page))
If I understand correctly, I can use the handle_data()
function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?
Upvotes: 4
Views: 10233
Reputation: 46
class HTMLParse(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'h2':
self.recordh2 = True
def handle_endtag(self, tag, attrs):
if tag == 'h2':
self.recordh2 = False
def handle_data(self, data):
if self.recordh2:
# do your work here
Upvotes: 1
Reputation: 24
html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
line = line.strip()
if line.startswith("<h2"):
data = data+line
hp = MyHTMLParser()
hp.feed(data)
hp.close()
thus you can extract data from h2 tag, hope it can help
Upvotes: 0
Reputation: 53
I don't have time to format/clean this up it but this is how I usually do it:
class HTMLParse(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attr):
if tag.lower() == "a":
for item in attr:
#print item
if item[0].lower() == "href":
path = urlparse.urlparse(item[1]).path
ext = os.path.splitext(path)[1]
if ext.lower() in (".jpeg", ".jpg", ".png",
".bmp"):
print "Found: "+ item[1]
Upvotes: 0