Recursive crawling same page using javascript with scrapy and splash

Question

I am crawling a site which have javascript to go to next page. I am using splash to execute my javascript code on first page. But I was able to go to 2nd page. But I am unable to go to the 3,4,5.... pages. crawling is stopped after only one page.

the link I am crawling: http://59.180.234.21:8788/user/viewallrecord.aspx

The code:

import scrapy
from scrapy_splash import SplashRequest
from time import sleep


class MSEDCLSpider(scrapy.Spider):
    name = "msedcl_spider"
    scope_path = 'body > table:nth-child(11) tr > td.content_area > table:nth-child(4) tr:not(:first-child)'
    ref_no_path = "td:nth-child(1) ::text"
    title_path = "td:nth-child(2) ::text"
    end_date_path = "td:nth-child(5) ::text"
    fee_path = "td:nth-child(6) ::text"
    start_urls = ["http://59.180.234.21:8788/user/viewallrecord.aspx"]

    lua_src = """function main(splash)
        local url = splash.args.url
        splash:go(url)
        splash:wait(2.0)
        splash:runjs("document.querySelectorAll('#lnkNext')[0].click()")
        splash:wait(4.0)
        return {
            splash:html(),
        }
        end
        """

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint='execute',
                method='POST',
                dont_filter=True,
                args={
                    'wait': 1.0,
                    'lua_source': self.lua_src,
                },
            )


    def parse(self, response):
        print response.status
        scopes = response.css('#page-info').extract()[0]
        print(response.url)
        print(scopes)

I am newbie to both scrapy and splash. please be gentle. Thank you

Mikhail Korobov · Accepted Answer

I can see two issues:

You are not making these requests. In start_requests there is a single request issued, the response is parsed in self.parse method, but requests to 3rd and other pages are never sent. To do that you need to send some requests from your .parse method.
If you fix (1) then you'll likely face a next issue: Splash doesn't keep page state between requests. Think of each request as opening a new Private Mode browser window and doing some actions; this is by design. But the problem with this website is that URL doesn't change between pages, so you can't just start e.g. from 3rd page and click "next" page.

But I think there are ways to workaround (2). Maybe you can get page html after clicking and then load it to a browser using splash:set_content; you may also preserve cookies - there is an example in scrapy-splash README; though it doesn't seem this website relies on cookies for pagination.

Another way is to write a script which loads all pages, not only next page, and then returns content of all pages to a client. Something like this (untested):

function main(splash) 
    splash:go(splash.args.url)
    local pages = {splash:html()}
    for i = 2,100 do             
        splash:runjs("document.querySelectorAll('#lnkNext')[0].click()")            
        splash:wait(4)
        pages[i] = splash:html()
    end
    return pages
end

For this to work you will need a much larger timeout value; you may also have to start Splash with a larger --max-timeout option.

Recursive crawling same page using javascript with scrapy and splash

Answers (1)

Related Questions