The
The

Reputation: 49

How can I jump to next page in Scrapy

I'm trying to scrape the results from here using scrapy. The problem is that not all of the classes appear on the page until the 'load more results' tab is clicked.

The problem can be seen here:

enter image description here

My code looks like this:

class ClassCentralSpider(CrawlSpider):
    name = "class_central"
    allowed_domains = ["www.class-central.com"]
    start_urls = (
        'https://www.class-central.com/courses/recentlyAdded',
    )
    rules = (
        Rule(
            LinkExtractor(
                # allow=("index\d00\.html",),
                restrict_xpaths=('//div[@id="show-more-courses"]',)
            ),
            callback='parse',
            follow=True
        ),
    )

def parse(self, response):
    x = response.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print item['name']

    pass

Upvotes: 0

Views: 2264

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

The second page for this website seems to be generated via AJAX call. If you look into network tab of any browser inspection tool, you'll see something like:

firebug network tab

In this case it seems to be retrieving a json file from https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134

Now it seems that url parameter _=1469471093134 does nothing so you can just trim it away to: https://www.class-central.com/maestro/courses/recentlyAdded?page=2
The return json contains html code for the next page:

# so you just need to load it up with 
data = json.loads(response.body) 
# and convert it to scrapy selector - 
sel = Selector(text=data['table'])

To replicate this in your code try something like:

from w3lib.url import add_or_replace_parameter 
def parse(self, response):
    # check if response is json, if so convert to selector
    if response.meta.get('is_json',False):
        # convert the json to scrapy.Selector here for parsing
        sel = Selector(text=json.loads(response.body)['table'])
    else:
        sel = Selector(response) 
    # parse page here for items
    x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print(item['name'])
    # do next page
    next_page_el = respones.xpath("//div[@id='show-more-courses']")
    if next_page_el:  # there is next page
        next_page = response.meta.get('page',1) + 1
        # make next page url
        url = add_or_replace_parameter(url, 'page', next_page)
        yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)

Upvotes: 1

Related Questions