Clement Lombard
Clement Lombard

Reputation: 185

Scrapy mixing items fields

I am trying to scrape Restaurant reviews from TripAdvisor using scrapy. Reviews for a single restaurant are shared on different web pages (pagination). I scrape the reviews and then save the result in JSON file or a mongoDB.

The problem is that when I check the items scraped in the console the reviews are mixed, e.g restaurant A will have its reviews and some reviews of restaurant B, restaurant B will have these reviews missing.

I tried to change the MAX_CONCURRENT_REQUESTS in the settings but it did not affect the result.

Here is the spider.py code

class TripAdvisorItemSpider(scrapy.Spider):
name = 'tripadvisor'

custom_settings = {
    'COLLECTION_NAME' : 'tripadvisor'
}


def __init__(self, depth="1", *args, **kwargs):
    super(TripAdvisorItemSpider, self).__init__(*args, **kwargs)
    self.start_urls = get_start_urls()
    self.depth = int(depth)


def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url = url, callback = self.parse, meta = {'item' : Place.Place()})

def parse_review_page(self, response):        
    #On ajoute les reviews de la page actuelle à celle de la page précèdente
    item = response.meta['item']
    item['reviews'] += get_page_reviews(response)

    if(len(self.urls) > 0):
        yield scrapy.Request(url= self.urls.pop(0), callback = self.parse_review_page, meta = {'item' : item})
    else: 
        yield item

def parse(self, response):
    if (self.depth > 1):
        self.urls = create_pagination_urls(response.request.url, self.depth)
    item = response.meta['item']
    item['place'] = response.css("h1::text").extract_first()
    item['content'] = get_content(response)
    item['reviews'] = get_page_reviews(response)
    if(self.depth > 1):
        yield scrapy.Request(url=self.urls.pop(0), callback=self.parse_review_page, meta = {'item' : item})
    else:
        yield item

I am stuck with this problem, it must have something to do with the request object lifetime but I can't figure out what I did wrong.

Thanks for the help.

Upvotes: 0

Views: 275

Answers (1)

Clement Lombard
Clement Lombard

Reputation: 185

I found the answer,

I found that the requests, even if I use MAX_CONCURRENT_REQUESTS = 1 are sent asynchronously and not in the order they are called !

This resulted in the self.urls being redefined in-between 2 pagination requests, replacing the correct pages to iterate on with pages from another restaurant.

I solved the problem by transforming the class attribute self.urls to a regular variable that I pass from one request to another with the meta.

Lessons of the day:

  • Keep in mind that scrapy request tends to be very asynchronous even in simple cases
  • Be careful when handling class attributes

Upvotes: 1

Related Questions