Hojat Taheri
Hojat Taheri

Reputation: 177

extracting links from a page in python 3

i want to extract all links in a page and this is my code, but it does nothing, when i print the fetched page i prints it well but for parsing it doesn't do anything!!

from html.parser import HTMLParser
import urllib
import urllib.request


class myParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if (tag == "a"):
            for a in attrs:
                if (a[0] == "href"):
                    link = a[1]
                    if (link.find('http') >= 1):
                        print(link)
                        newParser = myParser()
                        newParser.feed(link)

url = "http://www.asriran.com"
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
handle = response.read()
parser = myParser()
print (handle)
parser.feed(str(handle))

Upvotes: 0

Views: 706

Answers (1)

Michał Machnicki
Michał Machnicki

Reputation: 2877

Your code doesn't print anything because of two reasons:

  • you don't decode http response and you try to parse bytes instead of string
  • link.find('http') >= 1 is never gonna be true for links starting with http or https. You should use instead link.find('http') == 0 or link.startswith('http')

If you want to stick to HTMLParser, you can modify your code as follows:

from html.parser import HTMLParser
import urllib.request


class myParser(HTMLParser):

    links = []

    def handle_starttag(self, tag, attrs):
        if tag =='a':
            for attr in attrs:
                if attr[0]=='href' and str(attr[1]).startswith('http'):
                    print(attr[1])
                    self.links.append(attr[1])


with urllib.request.urlopen("http://www.asriran.com") as response:
    handle = response.read().decode('utf-8')
parser = myParser()
parser.feed(handle)

http_links = myParser.links

Otherwise I would suggest to switch to Beautiful Soup and parse the response for example like that:

from bs4 import BeautifulSoup
import urllib.request

with urllib.request.urlopen("http://www.asriran.com") as response:
   html = response.read().decode('utf-8')

soup = BeautifulSoup(html, 'html.parser')

all_links = [a.get('href') for a in soup.find_all('a')]
http_links = [link for link in all_links if link.startswith('http')]

Upvotes: 3

Related Questions