zahlen
zahlen

Reputation: 73

Parsing HTML page in web crawler

I am trying to build a web crawler that crawls all the links on the page and adds them to a file.

My Python code contains a method that does the following:-

  1. Opens a given web page(urllib2 module is used)

  2. Checks if the HTTP header Content-Type contains text/html

  3. Converts the raw HTML response into readable code and stores it into html_string variable.

  4. It then creates an instance of the Link_Finder class which takes attributes base url(Spider_url) and page url(page_url). Link_Finder is defined in another module link_finder.py.
  5. html_string is then fed to the class using feed function.

Link_Finder class is explained in details below.

def gather_links(page_url):     #page_url is relative url
        html_string=''
        try :
            req=urllib2.urlopen(page_url)
            head=urllib2.Request(page_url)
            if 'text/html' in head.get_header('Content-Type'):              
                html_bytes=req.read()
                html_string=html_bytes.decode("utf-8")
            finder=LinkFinder(Spider.base_url,page_url)
            finder.feed(html_string)            
        except Exception as e:
            print "Exception " + str(e)
            return set()
        return finder.page_links()

The link_finder.py module uses standard Python HTMLParser and urlparse modules. Class Link_Finder inherits from HTMLParser and overrides the handle_starttag function to get all the a tags with href attribute and add the url's to a set(self.queue)

from HTMLParser import HTMLParser 
import urlparse     
class LinkFinder(HTMLParser):
    def __init__(self,base_url,page_url):       #page_url is relative url 
        super(LinkFinder,self).__init__()      
        self.base_url=base_url
        self.page_url=page_url
        self.links=set()
    def handle_starttag(self,tag,attrs):    #Override default handler methods
        if tag==a:
            for(key,value) in attrs:
                if key=='href':
                    url=urlparse.urljoin(self.base_url,value) #Get exact url
                    self.links.add(url)
    def error(self,message):
        pass
    def page_links(self):     #return set of links
        return self.links

I am getting an exception

argument of type 'NoneType' is not iterable

I think the problem in the way i used urllib2 Request to check the header content. I am a bit new to this so some explanation would be good

Upvotes: 0

Views: 1176

Answers (1)

user6399774
user6399774

Reputation: 116

I'd have used BeautifulSoup instead of HTMLParser like so -

soup = BeautifulSoup(pageContent)
links = soup.find_all('a')

Upvotes: 0

Related Questions