Scrapy crawler not processing XHR Request

Question

My spider is only crawling the first 10 pages, so I am assuming it is not entering the load more button though the Request.

I am scraping this website: http://www.t3.com/reviews.

My spider code:

import scrapy
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem


class T3Spider(scrapy.Spider):
    name = "t3" #spider name to call in terminal
    allowed_domains = ['t3.com'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.t3.com/reviews'] #url from which the spider will start crawling

    def parse(self, response):
        sel = Selector(response)
        review_links = sel.xpath('//div[@id="content"]//div/div/a/@href').extract()
        for link in review_links:
            yield Request(url="http://www.t3.com"+link, callback=self.parse_review)
#if there is a load-more button:
        if sel.xpath('//*[@class="load-more"]'):
            req = Request(url=r'http://www\.t3\.com/more/reviews/latest/\d+', headers = {"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"}, callback=self.parse)
            yield req
        else:
            return

    def parse_review(self, response):
        pass #all my scraped item fields

What I am doing wrong? Sorry but I am quite new to scrapy. Thanks for your time, patience and help.

alecxe · Accepted Answer

If you inspect the "Load More" button, you would not find any indication of how the link to load more reviews is constructed. The idea behind is rather easy - the numbers after http://www.t3.com/more/reviews/latest/ suspiciously look like a timestamp of the last loaded article. Here is how you can get it:

import calendar

from dateutil.parser import parse
import scrapy
from scrapy.http import Request


class T3Spider(scrapy.Spider):
    name = "t3"
    allowed_domains = ['t3.com']
    start_urls = ['http://www.t3.com/reviews']

    def parse(self, response):
        reviews = response.css('div.listingResult')
        for review in reviews:
            link = review.xpath("a/@href").extract()[0]
            yield Request(url="http://www.t3.com" + link, callback=self.parse_review)

        # TODO: handle exceptions here

        # extract the review date
        time = reviews[-1].xpath(".//time/@datetime").extract()[0]

        # convert a date into a timestamp
        timestamp = calendar.timegm(parse(time).timetuple())

        url = 'http://www.t3.com/more/reviews/latest/%d' % timestamp
        req = Request(url=url,
                      headers={"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"},
                      callback=self.parse)
        yield req

    def parse_review(self, response):
        print response.url

Notes:

this requires dateutil module to be installed
you should recheck the code and make sure you are getting all of the reviews without skipping any of them
you should somehow end this "Load more" thing

Scrapy crawler not processing XHR Request

Answers (1)

Related Questions