Reputation: 5931
I am trying to render and scrape an interactive website by invoking Splash through the Python script, basically following this tutorial:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
start_urls = ["http://example.com"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
filename = 'mywebsite-%s.html' % '1'
with open(filename, 'wb') as f:
f.write(response.body)
The output looks fine, however it's missing a part of the website that is loaded through ajax after a second or two, which is the content I actually need. Now the weird part is, if I access Splash directly inside the container through the web interface, set the same URL, and hit the Render button, the returned response is correct. So, the only question is, why when the Python script invokes it, it doesn't render the website correctly?
Upvotes: 5
Views: 2581
Reputation: 5931
I have tried what adrihanu has suggested, but it didn't work. After a while, I was wondering what would happen and if it is possible at all to execute the same script that Splash UI is executing. So, I learned that it is possible to pass the lua script as an argument, and it works!
script1 = """
function main(splash, args)
assert (splash:go(args.url))
assert (splash:wait(0.5))
return {
html = splash: html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='execute',
args={
'html': 1,
'lua_source': self.script1,
'wait': 0.5,
}
Upvotes: 1