somedude
somedude

Reputation: 91

Using Scrapy-splash to navigate dynamic pages

Background: I am using Scrapy to crawl and scrape product data from http://shop.nordstrom.com/c/mens-tshirts. The page is dynamically generated so I use Scrapy-Splash to deal with the JavaScript. The problem is, clicking the "Next" button on the bottom of the page is the only way to get to the subsequent product page. If you copy that url of page 2 and paste it into a new tab, the page has no products on it.

In order to combat this, I am trying to use the .click() function in Selenium to navigate to the next page, and driver.page_source to extract the html of the page.

Question: Is there a way to pass the html/javascript source that I extract into Splash (running inside a docker container), rather than passing in a url? I've tried saving the html on my local machine and passing the file path, but that results in a 502 Bad Gateway because Splash automatically prepends 'http://' to the path.

Maybe there's a better method for achieving my goal here, if so I'm open to any options. Please keep in mind that the solution must be appropriate for scalability and cloud deployment. Thanks!

Upvotes: 1

Views: 1768

Answers (1)

Mikhail Korobov
Mikhail Korobov

Reputation: 22238

You can write a Splash Lua script which calls splash:set_content instead of accepting an URL, something like this:

function main(splash, args)
    assert(splash:set_content(args.html_source))
    -- page is loaded, process it as needed
end

You can also click on a button in Splash itself - see element:mouse_click, something like this:

function main(splash, args)
    assert(splash:go(args.url))
    splash:select('.next'):mouse_click()
    splash:wait(5.0) 
    return splash:html()
end

Check the tutorial and Lua API overview for more. You can interact with the page like in Selenium; not all Selenium helpers are available, but basics are there.

Upvotes: 1

Related Questions