Christian Read
Christian Read

Reputation: 143

Insert no. of scraped item using Scrapy

I want to get the total number of the scraped item and the date when script run and insert it inside Mysql, I put the code inside Pipeline and it seems that the insertion of data is in the loop, where can I properly put those data I wanted it to put when scraping is done.

Any idea please? here's my code

import mysql.connector

class GpdealsSpiderPipeline_hm(object):

#some working code here

def store_db(self, item):
    self.curr.execute("""insert into status_hm (script_lastrun, new_sale_item, no_item_added, total_item) values (%s, %s, %s, %s)""", (
            'sample output',
            'sample output',
            'sample output',
            'sample output',
        ))
    self.conn.commit()

Error: mysql.connector.errors.IntegrityError: 1062 (23000): Duplicate entry '' for key 'PRIMARY'

So probably I am puting my code on wrong place. Please help thank you

Upvotes: 0

Views: 63

Answers (1)

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

Scrapy pipeline's purpose is to process single item at a time. However, you can achieve what you want by putting the logic in the close_spider method. You can get the total number of items scraped from Scrapy stats under the key item_scraped_count. See the example:

class ExamplePipeline(object):
    def close_spider(self, spider):
        stats = spider.crawler.stats.get_stats()
        print('Total number of scraped items:', stats['item_scraped_count'])

    def process_item(self, item, spider):
        # logic to process the item
        return item

To provide complete info, you can achieve your goal also by connecting to the signal spider_closed from pipeline, extension or from the spider itself. See this complete example connecting to the signal from the spider:

import scrapy
from scrapy import signals

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(QuotesSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        stats = spider.crawler.stats.get_stats()
        print('Total number of scraped items:', stats['item_scraped_count'])

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            item = {
                'text': quote.xpath('./*[@itemprop="text"]/text()').extract_first()
            }
            yield item

Upvotes: 1

Related Questions