Limit how much elements scrapy can collect

Question

I am using scrapy to collect some data. My scrapy program collects 100 elements at one session. I need to limit it to 50 or any random number. How can i do that? Any solution is welcomed. Thanks in advance

# -*- coding: utf-8 -*-
import re
import scrapy


class DmozItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    attr = scrapy.Field()
    title = scrapy.Field()
    tag = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["raleigh.craigslist.org"]
    start_urls = [
        "http://raleigh.craigslist.org/search/bab"
    ]

    BASE_URL = 'http://raleigh.craigslist.org/'

    def parse(self, response):
        links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_attr)

    def parse_attr(self, response):
        match = re.search(r"(\w+)\.html", response.url)
        if match:
            item_id = match.group(1)
            url = self.BASE_URL + "reply/ral/bab/" + item_id

            item = DmozItem()
            item["link"] = response.url
            item["title"] = "".join(response.xpath("//span[@class='postingtitletext']//text()").extract())
            item["tag"] = "".join(response.xpath("//p[@class='attrgroup']/span/b/text()").extract()[0])
            return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)

    def parse_contact(self, response):
        item = response.meta['item']
        item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
        return item

alecxe · Accepted Answer

This is what CloseSpider extension and CLOSESPIDER_ITEMCOUNT setting were made for:

An integer which specifies a number of items. If the spider scrapes more than that amount if items and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. If zero (or non set), spiders won’t be closed by number of passed items.

Limit how much elements scrapy can collect

Answers (2)

Related Questions