George
George

Reputation: 37

How write to json file the time when the Scrapy spider has finished scraping?

How an example code on how to record into the json file the time when the Scrapy spider/crawler stops (completes) collecting the data. Example code below:

Example CrawlSpider:

from scrapy.http import request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ebaycomp.items import EbayItem


class EbaySpider(CrawlSpider):

    name = 'spider'
    allowed_domains = ['ebay.co.uk']

    start_urls = ['https://www.ebay.co.uk/sch/49831/i.html?_from=R40&_nkw=chain+and+sprocket+kit&LH_ItemCondition=1000&rt=nc&LH_PrefLoc=1',
                  'https://www.ebay.co.uk/sch/177771/i.html?_from=R40&_nkw=motorcycle+air+filter&LH_ItemCondition=1000&rt=nc&LH_PrefLoc=1']

    rules = [Rule(LinkExtractor(allow=('.*'),
                                restrict_xpaths=(['//a[@class="s-item__link"][1]',
                                                  '//a[@class="s-item__link"][2]',
                                                  '//a[@class="s-item__link"][3]'
                                                  ])), callback='parse_items', follow=True)]

    def parse_items(self, response):

        scrapedItem = EbayItem()
        scrapedItem['startUrl'] = 'how to properly return the start_url?'
        scrapedItem['productUrl'] = 'how to properly return the 3 product urls?'
        scrapedItem['productTitle'] = response.xpath('//h1/text()').get()
        scrapedItem['productPrice'] = response.xpath('//span[@itemprop="price"]/text()').get()
        scrapedItem['timeClosed'] = 'the time the spider has stopped'

        return scrapedItem

Here's my json pipeline (not sure how to extract time and feed into json output):

class JsonWriterPipeline:

def open_spider(self, spider):
    self.file = open('ebay_out.jl', 'w')

def close_spider(self, spider):
    self.file.close()

def process_item(self, productItem, spider):
    line = json.dumps(ItemAdapter(productItem).asdict()) + "\n"
    self.file.write(line)
    return productItem

Upvotes: 0

Views: 608

Answers (1)

SuperUser
SuperUser

Reputation: 4822

Please read again the documentation, specificaly about crawlers.

  1. Scrapy will follow your rules for each url in start_urls.
  2. When you get to the parse_item you're already in the product page, so if you want to get the url it's just 'response.url'.

If you want to limit your CrawlSpider then you can set:

custom_settings = (
        {
            'CLOSESPIDER_PAGECOUNT': whatever number you want,
            'CONCURRENT_REQUESTS': 1,
        }
    )
# also you can just leave restrict_xpaths=(['//a[@class="s-item__link"'],]) and lose the rest.

(More information here).

  1. If you want json then why do you write it to csv?
  2. There's an example in the scrapy documentation on writing items to a json file.
  3. In this case I think it's better to use scrapy.Spider.

Upvotes: 1

Related Questions