python_newbie
python_newbie

Reputation: 123

HTML Parser handle_starttag()

I am trying to get all of the absolute files into a list called https. However, when I run my code, and try to return https it returns an empty list. Could someone help me?

def getWebInfo(url):
    infile=urlopen(url)
    content=infile.read().decode()
    infile.close()
    https=[]

    def handle_starttag(tag, attrs):
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(url, attr[1])
                     if absolute[:7]=='http://':
                         https.append(absolute)
    parser=HTMLParser()
    parser.feed(content)

    print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
    print('--------------------------------------')
    return https

getWebInfo('https://python.readthedocs.io/en/v2.7.2/library/htmlparser.html')

returns:

ALL ABSOLUTE LINKS ON THE WEB PAGE

[]

I want to be able to run the code so that when I input any url it returns the absolute links found on that webpage. I don't really want to use BeautifulSoup.. Can anyone help me

EDITED I called handle_starttag within my code, and now I get this error:
if attr[0] == 'href': TypeError: 'HTTPResponse' object does not support indexing

Upvotes: 0

Views: 3627

Answers (2)

snakecharmerb
snakecharmerb

Reputation: 55924

The HTMLParser class isn't designed to be used out of the box. The idea is that you make your own class that inherits from HTMLParser and override the methods that you want to use. In practice this means adding your 'handle_starttag' function to a class, like this:

class MyParser(HTMLParser):   # <- new class is a subclass of HTMLParser

    def handle_starttag(self, tag, attrs):  # <- methods need a self argument
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(url, attr[1])
                     if absolute[:7]=='http://':
                         https.append(absolute)

There's a problem with handle_starttag though: now that it's inside a class, the names https and url are not defined. You can fix this by making them attributes of your parser after you've created it, like this:

parser = MyParser()
parser.https = https
parser.url = url

and prefix them in the handle_starttags method with self., so that the Python interpreter looks for these attributes in your parser. So your code should end up looking like this:

class MyParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag.lower() == 'a':
             for attr in attrs:
                 if attr[0]=='href':
                     absolute=urljoin(self.url, attr[1])
                     if absolute[:7]=='http://':
                         self.https.append(absolute)
 
 
def getWebInfo(url):
    infile=urlopen(url)
    content=infile.read().decode()
    infile.close()
    https=[]

    parser=MyParser()
    parser.https = https
    parser.url = url
    parser.feed(content)

    print('ALL ABSOLUTE LINKS ON THE WEB PAGE')
    print('--------------------------------------')
    return https

links = getWebInfo('https://docs.python.org/3/library/html.parser.html')


for link in links:
    print(link)

An alternative implementation of handle_starttag, using modern Python features might look like this:

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag in 'Aa':
            self.https.extend(
                [
                    url
                    for (name, value) in attrs
                    if name == 'href'
                    and (url := urljoin(self.url, value))
                    and url.startswith('https://')
                ]
            )

Upvotes: 2

tknickman
tknickman

Reputation: 4641

The problem here is that you are never modifying your https list. You define the handle_starttag function - which appends to the list - but then never call it.

Upvotes: 1

Related Questions