Splash not parse JS before returning HTML response

Question

in my crawler consisting of Scrapy and a Splash server I am having problems on this site: https://www.lavoropiu.it/offerte

The problem is related to Splash downloading the site's HTML without parsed JS. The site is an Angular app.

I have tried with different splash settings:

splash.private_mode_enabled = false

splash.js_enabled = true

The returned HTML is this:



Lavoropiu

As you can see Splash does not execute the scripts on the page before returning the HTML. Is this an issue related with Splash or I am missing some setting?

Thanks for your help.

Simba · Accepted Answer

Splash fails to load the javascript. This is a common problem encountered when doing scraping with Splash. Check the issue page of Splash and do some searching, there's a lot report about the javascript loading problem.

The default engine for Splash is Webkit. It behaves differently with common Web browser engines in Chrome, Firefox. For web scraping, you'd better choose headless Chrome to download pages with javascript.

For async integration with scrapy, try playwright, or puppeteer. The former has a scrapy plugin scrapy-playwright, which is currently maintained.

Update: Scrapy API render.html has support for switching engine to "chromium". But it's experimental. You can have a try.

Splash not parse JS before returning HTML response

Answers (1)

Related Questions