Reputation: 185
I am trying to scrape Restaurant reviews from TripAdvisor using scrapy. Reviews for a single restaurant are shared on different web pages (pagination). I scrape the reviews and then save the result in JSON file or a mongoDB.
The problem is that when I check the items scraped in the console the reviews are mixed, e.g restaurant A will have its reviews and some reviews of restaurant B, restaurant B will have these reviews missing.
I tried to change the MAX_CONCURRENT_REQUESTS in the settings but it did not affect the result.
Here is the spider.py code
class TripAdvisorItemSpider(scrapy.Spider):
name = 'tripadvisor'
custom_settings = {
'COLLECTION_NAME' : 'tripadvisor'
}
def __init__(self, depth="1", *args, **kwargs):
super(TripAdvisorItemSpider, self).__init__(*args, **kwargs)
self.start_urls = get_start_urls()
self.depth = int(depth)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url = url, callback = self.parse, meta = {'item' : Place.Place()})
def parse_review_page(self, response):
#On ajoute les reviews de la page actuelle à celle de la page précèdente
item = response.meta['item']
item['reviews'] += get_page_reviews(response)
if(len(self.urls) > 0):
yield scrapy.Request(url= self.urls.pop(0), callback = self.parse_review_page, meta = {'item' : item})
else:
yield item
def parse(self, response):
if (self.depth > 1):
self.urls = create_pagination_urls(response.request.url, self.depth)
item = response.meta['item']
item['place'] = response.css("h1::text").extract_first()
item['content'] = get_content(response)
item['reviews'] = get_page_reviews(response)
if(self.depth > 1):
yield scrapy.Request(url=self.urls.pop(0), callback=self.parse_review_page, meta = {'item' : item})
else:
yield item
I am stuck with this problem, it must have something to do with the request object lifetime but I can't figure out what I did wrong.
Thanks for the help.
Upvotes: 0
Views: 275
Reputation: 185
I found the answer,
I found that the requests, even if I use MAX_CONCURRENT_REQUESTS = 1
are sent asynchronously and not in the order they are called !
This resulted in the self.urls
being redefined in-between 2 pagination requests, replacing the correct pages to iterate on with pages from another restaurant.
I solved the problem by transforming the class attribute self.urls
to a regular variable that I pass from one request to another with the meta.
Lessons of the day:
Upvotes: 1