Reputation: 37
How an example code on how to record into the json file the time when the Scrapy spider/crawler stops (completes) collecting the data. Example code below:
Example CrawlSpider:
from scrapy.http import request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ebaycomp.items import EbayItem
class EbaySpider(CrawlSpider):
name = 'spider'
allowed_domains = ['ebay.co.uk']
start_urls = ['https://www.ebay.co.uk/sch/49831/i.html?_from=R40&_nkw=chain+and+sprocket+kit&LH_ItemCondition=1000&rt=nc&LH_PrefLoc=1',
'https://www.ebay.co.uk/sch/177771/i.html?_from=R40&_nkw=motorcycle+air+filter&LH_ItemCondition=1000&rt=nc&LH_PrefLoc=1']
rules = [Rule(LinkExtractor(allow=('.*'),
restrict_xpaths=(['//a[@class="s-item__link"][1]',
'//a[@class="s-item__link"][2]',
'//a[@class="s-item__link"][3]'
])), callback='parse_items', follow=True)]
def parse_items(self, response):
scrapedItem = EbayItem()
scrapedItem['startUrl'] = 'how to properly return the start_url?'
scrapedItem['productUrl'] = 'how to properly return the 3 product urls?'
scrapedItem['productTitle'] = response.xpath('//h1/text()').get()
scrapedItem['productPrice'] = response.xpath('//span[@itemprop="price"]/text()').get()
scrapedItem['timeClosed'] = 'the time the spider has stopped'
return scrapedItem
Here's my json pipeline (not sure how to extract time and feed into json output):
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('ebay_out.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, productItem, spider):
line = json.dumps(ItemAdapter(productItem).asdict()) + "\n"
self.file.write(line)
return productItem
Upvotes: 0
Views: 608
Reputation: 4822
Please read again the documentation, specificaly about crawlers.
If you want to limit your CrawlSpider then you can set:
custom_settings = (
{
'CLOSESPIDER_PAGECOUNT': whatever number you want,
'CONCURRENT_REQUESTS': 1,
}
)
# also you can just leave restrict_xpaths=(['//a[@class="s-item__link"'],]) and lose the rest.
(More information here).
Upvotes: 1