Alberto Favaro
Alberto Favaro

Reputation: 1842

Splash not parse JS before returning HTML response

in my crawler consisting of Scrapy and a Splash server I am having problems on this site: https://www.lavoropiu.it/offerte

The problem is related to Splash downloading the site's HTML without parsed JS. The site is an Angular app.

I have tried with different splash settings:

splash.private_mode_enabled = false

splash.js_enabled = true

The returned HTML is this:

<!DOCTYPE html><html lang="en"><head>
<meta charset="utf-8">
<title>Lavoropiu</title>
<base href="/">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" type="image/x-icon" href="favicon.ico">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/chphsalvo/[email protected]/dist/css/style.min.css">

<!-- Global site tag (gtag.js) - Google Analytics -->
<script type="text/javascript" async="" src="https://www.google- analytics.com/analytics.js"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-173597693-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());

gtag('config', 'UA-173597693-1', { send_page_view: false});
</script>

<link rel="stylesheet" href="styles.66ab468982a30141059e.css">
</head>
<body>

<script src="runtime.d6c52737d4587c65265f.js" defer=""></script>
<script src="polyfills.f782e0cdb7e1242a13e4.js" defer=""></script>
<script src="vendor.82696fd86eeed5072685.js" defer=""></script>
<script src="main.076dbf684e565ed2798b.js" defer=""></script>

<app-root></app-root>

</body>
</html>

As you can see Splash does not execute the scripts on the page before returning the HTML. Is this an issue related with Splash or I am missing some setting?

Thanks for your help.

Upvotes: 0

Views: 692

Answers (1)

Simba
Simba

Reputation: 27588

Splash fails to load the javascript. This is a common problem encountered when doing scraping with Splash. Check the issue page of Splash and do some searching, there's a lot report about the javascript loading problem.

The default engine for Splash is Webkit. It behaves differently with common Web browser engines in Chrome, Firefox. For web scraping, you'd better choose headless Chrome to download pages with javascript.

For async integration with scrapy, try playwright, or puppeteer. The former has a scrapy plugin scrapy-playwright, which is currently maintained.


Update: Scrapy API render.html has support for switching engine to "chromium". But it's experimental. You can have a try.

Upvotes: 1

Related Questions