Reputation: 1842
in my crawler consisting of Scrapy and a Splash server I am having problems on this site: https://www.lavoropiu.it/offerte
The problem is related to Splash downloading the site's HTML without parsed JS. The site is an Angular app.
I have tried with different splash settings:
splash.private_mode_enabled = false
splash.js_enabled = true
The returned HTML is this:
<!DOCTYPE html><html lang="en"><head>
<meta charset="utf-8">
<title>Lavoropiu</title>
<base href="/">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" type="image/x-icon" href="favicon.ico">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/chphsalvo/[email protected]/dist/css/style.min.css">
<!-- Global site tag (gtag.js) - Google Analytics -->
<script type="text/javascript" async="" src="https://www.google- analytics.com/analytics.js"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-173597693-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-173597693-1', { send_page_view: false});
</script>
<link rel="stylesheet" href="styles.66ab468982a30141059e.css">
</head>
<body>
<script src="runtime.d6c52737d4587c65265f.js" defer=""></script>
<script src="polyfills.f782e0cdb7e1242a13e4.js" defer=""></script>
<script src="vendor.82696fd86eeed5072685.js" defer=""></script>
<script src="main.076dbf684e565ed2798b.js" defer=""></script>
<app-root></app-root>
</body>
</html>
As you can see Splash does not execute the scripts on the page before returning the HTML. Is this an issue related with Splash or I am missing some setting?
Thanks for your help.
Upvotes: 0
Views: 692
Reputation: 27588
Splash fails to load the javascript. This is a common problem encountered when doing scraping with Splash. Check the issue page of Splash and do some searching, there's a lot report about the javascript loading problem.
The default engine for Splash is Webkit. It behaves differently with common Web browser engines in Chrome, Firefox. For web scraping, you'd better choose headless Chrome to download pages with javascript.
For async integration with scrapy, try playwright, or puppeteer. The former has a scrapy plugin scrapy-playwright, which is currently maintained.
Update: Scrapy API render.html
has support for switching engine to "chromium". But it's experimental. You can have a try.
Upvotes: 1