HTML Parser handle_starttag()

Question

I am trying to get all of the absolute files into a list called https. However, when I run my code, and try to return https it returns an empty list. Could someone help me?

def getWebInfo(url):
    infile=urlopen(url)
    content=infile.read().decode()
    infile.close()
    https=[]

    def handle_starttag(tag, attrs):
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(url, attr[1])
                     if absolute[:7]=='http://':
                         https.append(absolute)
    parser=HTMLParser()
    parser.feed(content)

    print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
    print('--------------------------------------')
    return https

getWebInfo('https://python.readthedocs.io/en/v2.7.2/library/htmlparser.html')

returns:

ALL ABSOLUTE LINKS ON THE WEB PAGE

[]

I want to be able to run the code so that when I input any url it returns the absolute links found on that webpage. I don't really want to use BeautifulSoup.. Can anyone help me

EDITED I called handle_starttag within my code, and now I get this error:
if attr[0] == 'href': TypeError: 'HTTPResponse' object does not support indexing

snakecharmerb · Accepted Answer

The HTMLParser class isn't designed to be used out of the box. The idea is that you make your own class that inherits from HTMLParser and override the methods that you want to use. In practice this means adding your 'handle_starttag' function to a class, like this:

class MyParser(HTMLParser):   # <- new class is a subclass of HTMLParser

    def handle_starttag(self, tag, attrs):  # <- methods need a self argument
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(url, attr[1])
                     if absolute[:7]=='http://':
                         https.append(absolute)

There's a problem with handle_starttag though: now that it's inside a class, the names https and url are not defined. You can fix this by making them attributes of your parser after you've created it, like this:

parser = MyParser()
parser.https = https
parser.url = url

and prefix them in the handle_starttags method with self., so that the Python interpreter looks for these attributes in your parser. So your code should end up looking like this:

class MyParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(self.url, attr[1])
                     if absolute[:7]=='http://':
                         self.https.append(absolute)
 
 
def getWebInfo(url):
    infile=urlopen(url)
    content=infile.read().decode()
    infile.close()
    https=[]

    parser=MyParser()
    parser.https = https
    parser.url = url
    parser.feed(content)

    print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
    print('--------------------------------------')
    return https

links = getWebInfo('https://docs.python.org/3/library/html.parser.html')


for link in links:
    print(link)

An alternative implementation of handle_starttag, using modern Python features might look like this:

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag in 'Aa':
            self.https.extend(
                [
                    url
                    for (name, value) in attrs
                    if name == 'href'
                    and (url := urljoin(self.url, value))
                    and url.startswith('https://')
                ]
            )

HTML Parser handle_starttag()

ALL ABSOLUTE LINKS ON THE WEB PAGE

Answers (2)

Related Questions