Reputation: 123
I am trying to get all of the absolute files into a list called https. However, when I run my code, and try to return https it returns an empty list. Could someone help me?
def getWebInfo(url):
infile=urlopen(url)
content=infile.read().decode()
infile.close()
https=[]
def handle_starttag(tag, attrs):
if tag.lower() == 'a':
for attr in attrs:
if attr[0]=='href':
absolute=urljoin(url, attr[1])
if absolute[:7]=='http://':
https.append(absolute)
parser=HTMLParser()
parser.feed(content)
print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
print('--------------------------------------')
return https
getWebInfo('https://python.readthedocs.io/en/v2.7.2/library/htmlparser.html')
returns:
[]
I want to be able to run the code so that when I input any url it returns the absolute links found on that webpage. I don't really want to use BeautifulSoup.. Can anyone help me
EDITED
I called handle_starttag within my code, and now I get this error:
if attr[0] == 'href':
TypeError: 'HTTPResponse' object does not support indexing
Upvotes: 0
Views: 3627
Reputation: 55924
The HTMLParser class isn't designed to be used out of the box. The idea is that you make your own class that inherits from HTMLParser and override the methods that you want to use. In practice this means adding your 'handle_starttag' function to a class, like this:
class MyParser(HTMLParser): # <- new class is a subclass of HTMLParser
def handle_starttag(self, tag, attrs): # <- methods need a self argument
if tag.lower() == 'a':
for attr in attrs:
if attr[0]=='href':
absolute=urljoin(url, attr[1])
if absolute[:7]=='http://':
https.append(absolute)
There's a problem with handle_starttag though: now that it's inside a class, the names https
and url
are not defined. You can fix this by making them attributes of your parser after you've created it, like this:
parser = MyParser()
parser.https = https
parser.url = url
and prefix them in the handle_starttags method with self.
, so that the Python interpreter looks for these attributes in your parser. So your code should end up looking like this:
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag.lower() == 'a':
for attr in attrs:
if attr[0]=='href':
absolute=urljoin(self.url, attr[1])
if absolute[:7]=='http://':
self.https.append(absolute)
def getWebInfo(url):
infile=urlopen(url)
content=infile.read().decode()
infile.close()
https=[]
parser=MyParser()
parser.https = https
parser.url = url
parser.feed(content)
print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
print('--------------------------------------')
return https
links = getWebInfo('https://docs.python.org/3/library/html.parser.html')
for link in links:
print(link)
An alternative implementation of handle_starttag
, using modern Python features might look like this:
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag in 'Aa':
self.https.extend(
[
url
for (name, value) in attrs
if name == 'href'
and (url := urljoin(self.url, value))
and url.startswith('https://')
]
)
Upvotes: 2
Reputation: 4641
The problem here is that you are never modifying your https
list. You define the handle_starttag
function - which appends to the list - but then never call it.
Upvotes: 1