Simple Python Crawler / Spider Runtime Error

Question

I have a simple python crawler / spider that searches for a specified text on a site that i provide. But in some sites it crawls normally for 2-4 sec until an error is occurred.

The code so far:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import requests, pyquery, urlparse

try:
    range = xrange
except NameError:
    pass

def crawl(seed, depth, terms):

    crawled = set()
    uris = set([seed])
   for level in range(depth):
       new_uris = set()
       for uri in uris:
          if uri in crawled:
               continue
          crawled.add(uri)
          # Get URI contents
          try:
              content = requests.get(uri).content
         except:
             continue
         # Look for the terms
         found = 0
         for term in terms:
             if term in content:
                 found += 1
          if found > 0:
              yield (uri, found, level + 1)
          # Find child URIs, and add them to the new_uris set
          dom = pyquery.PyQuery(content)
          for anchor in dom('a'):
              try:
                 link = anchor.attrib['href']
             except KeyError:
                    continue
                new_uri = urlparse.urljoin(uri, link)
                new_uris.add(new_uri)
        uris = new_uris

if __name__ == '__main__':
    import sys
    if len(sys.argv) < 4:
        print('usage: ' + sys.argv[0] + 
            "start_url crawl_depth term1 [term2 [...]]")
       print('       ' + sys.argv[0] + 
           " http://yahoo.com 5 cute 'fluffy kitties'")
       raise SystemExit

 seed_uri = sys.argv[1]
 crawl_depth = int(sys.argv[2])
 search_terms = sys.argv[3:]

 for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
     print(uri)

Now let's say that i want to find all the pages that have the "requireLazy(" in their source. Let's try it with facebook, if i execute this:

python crawler.py https://www.facebook.com 4 '

Ion Scerbatiuc · Accepted Answer

Seems that the page content you are trying to parse has some invalid tags. Normally the best you could do is to catch and log this kind of errors and gracefully advance to the next pages.

Hopefully you could use BeautifulSoup to extract the URLs of the next pages to be crawled and it will handle the most of the bad content gracefully. You can find more details about BeatifulSoup and how to use it here.

UPDATE

Actually after playing around with the crawler it seems that at some point the page content is empty so the parser fails to load the document.

I tested the crawler with BeautifoulSoup and it's working properly. If you need/want I can share you my updated version.

You can easily add a check for empty contents, but I'm not sure what other edge cases you could encounter, so switching to BeautifulSoup seems like a safer approach.

Simple Python Crawler / Spider Runtime Error

Answers (1)

UPDATE

Related Questions