Reputation: 91
Background: I am using Scrapy to crawl and scrape product data from http://shop.nordstrom.com/c/mens-tshirts
. The page is dynamically generated so I use Scrapy-Splash to deal with the JavaScript. The problem is, clicking the "Next" button on the bottom of the page is the only way to get to the subsequent product page. If you copy that url of page 2 and paste it into a new tab, the page has no products on it.
In order to combat this, I am trying to use the .click()
function in Selenium to navigate to the next page, and driver.page_source
to extract the html of the page.
Question: Is there a way to pass the html/javascript source that I extract into Splash (running inside a docker container), rather than passing in a url? I've tried saving the html on my local machine and passing the file path, but that results in a 502 Bad Gateway because Splash automatically prepends 'http://' to the path.
Maybe there's a better method for achieving my goal here, if so I'm open to any options. Please keep in mind that the solution must be appropriate for scalability and cloud deployment. Thanks!
Upvotes: 1
Views: 1768
Reputation: 22238
You can write a Splash Lua script which calls splash:set_content instead of accepting an URL, something like this:
function main(splash, args)
assert(splash:set_content(args.html_source))
-- page is loaded, process it as needed
end
You can also click on a button in Splash itself - see element:mouse_click, something like this:
function main(splash, args)
assert(splash:go(args.url))
splash:select('.next'):mouse_click()
splash:wait(5.0)
return splash:html()
end
Check the tutorial and Lua API overview for more. You can interact with the page like in Selenium; not all Selenium helpers are available, but basics are there.
Upvotes: 1