Reputation: 1395
I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:
class DownloadSpider(CrawlSpider):
name = 'downloader'
download_path = '/home/MyProjects/crawler'
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)
def __init__(self, *args, **kwargs):
super(DownloadSpider, self).__init__(*args, **kwargs)
self.urls_file_path = [kwargs.get('urls_file')]
data = open(self.urls_file_path[0], 'r').readlines()
self.allowed_domains = [urlparse(i).hostname.strip() for i in data]
self.start_urls = ['http://' + domain for domain in self.allowed_domains]
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
self.fname = self.download_path + urlparse(response.url).hostname.strip()
open(str(self.fname)+ '.txt', 'a').write(response.url)
open(str(self.fname)+ '.txt', 'a').write('\n')
urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT
exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url.
I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up.
Any advice is much appreciated !
Upvotes: 7
Views: 6517
Reputation: 51
To do this you can create your own link extractor class based on SgmlLinkExtractor. It should look something like this:
from scrapy.selector import Selector
from scrapy.utils.response import get_base_url
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class LimitedLinkExtractor(SgmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None,
deny_extensions=None, max_pages=20):
self.max_pages=max_pages
SgmlLinkExtractor.__init__(self, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths,
tags=tags, attrs=attrs, canonicalize=canonicalize, unique=unique, process_value=process_value,
deny_extensions=deny_extensions)
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
sel = Selector(response)
base_url = get_base_url(response)
body = u''.join(f
for x in self.restrict_xpaths
for f in sel.xpath(x).extract()
).encode(response.encoding, errors='xmlcharrefreplace')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
links = links[0:self.max_pages]
return links
The code of this subclass completely based on the code of the class SgmlLinkExtractor. I've just added variable self.max_pages to the class constructor and line which cut the list of links in the end of extract_links method. But you can cut this list in more intelligent way.
Upvotes: 5
Reputation: 474191
I'd make per-class variable, initialize it with stats = defaultdict(int)
and increment self.stats[response.url]
(or may be the key could be a tuple like (website, depth)
in your case) in parse_item
.
This is how I imagine this - should work in theory. Let me know if you need an example.
FYI, you can extract base url and calculate depth with the help of urlparse.urlparse
(see docs).
Upvotes: 4