Reputation: 73
I am trying to build a web crawler that crawls all the links on the page and adds them to a file.
My Python code contains a method that does the following:-
Opens a given web page(urllib2 module is used)
Checks if the HTTP header Content-Type contains text/html
Converts the raw HTML response into readable code and stores it into html_string variable.
Link_Finder class is explained in details below.
def gather_links(page_url): #page_url is relative url
html_string=''
try :
req=urllib2.urlopen(page_url)
head=urllib2.Request(page_url)
if 'text/html' in head.get_header('Content-Type'):
html_bytes=req.read()
html_string=html_bytes.decode("utf-8")
finder=LinkFinder(Spider.base_url,page_url)
finder.feed(html_string)
except Exception as e:
print "Exception " + str(e)
return set()
return finder.page_links()
The link_finder.py module uses standard Python HTMLParser and urlparse modules. Class Link_Finder inherits from HTMLParser and overrides the handle_starttag function to get all the a tags with href attribute and add the url's to a set(self.queue)
from HTMLParser import HTMLParser
import urlparse
class LinkFinder(HTMLParser):
def __init__(self,base_url,page_url): #page_url is relative url
super(LinkFinder,self).__init__()
self.base_url=base_url
self.page_url=page_url
self.links=set()
def handle_starttag(self,tag,attrs): #Override default handler methods
if tag==a:
for(key,value) in attrs:
if key=='href':
url=urlparse.urljoin(self.base_url,value) #Get exact url
self.links.add(url)
def error(self,message):
pass
def page_links(self): #return set of links
return self.links
I am getting an exception
argument of type 'NoneType' is not iterable
I think the problem in the way i used urllib2 Request to check the header content. I am a bit new to this so some explanation would be good
Upvotes: 0
Views: 1176
Reputation: 116
I'd have used BeautifulSoup instead of HTMLParser like so -
soup = BeautifulSoup(pageContent)
links = soup.find_all('a')
Upvotes: 0