Clark
Clark

Reputation: 15

scrapy fetching incomplete html

I'm a retired programmer but new to scrapy. Actually, this is my first python project so I could be doing anything wrong.

I brought up scrapy under anaconda and started a shell with :

 scrapy shell "https://sailing-channels.com/by-subscribers"

Looks like everything is working fine and I can get some querys to work.

Here is my problem: when I enter :

response.css('body').extract()

I get:['<body><noscript>If you\'re seeing this message, that means <strong>JavaScript has been disabled on your browser</strong>, please <strong>enable JS</strong> to make this app work.</noscript><div id="app"></div><script src="//apis.google.com/js/platform.js" async></script><script>!function(e,a,n,t,g,c,i){e.GoogleAnalyticsObject="ga",e.ga=e.ga||function(){(e.ga.q=e.ga.q||[]).push(arguments)},e.ga.l=1*new Date,c=a.createElement(n),i=a.getElementsByTagName(n)[0],c.async=1,c.src="//www.google-analytics.com/analytics.js",i.parentNode.insertBefore(c,i)}(window,document,"script"),ga("create","UA-15981085-17","auto"),ga("require","linkid"),ga("set","anonymizeIp",!0),ga("send","pageview")</script><script type="application/ld+json">{\n\t\t\t"@context": "http://schema.org",\n\t\t\t"@type": "Organization",\n\t\t\t"name": "Sailing Channels"\n\t\t\t"url": "https://www.sailing-channels.com",\n\t\t\t"logo": "https://sailing-channels.com/img/banner.png",\n\t\t\t"sameAs" : [\n\t\t\t\t"https://www.facebook.com/sailingchannels",\n\t\t\t\t"https://twitter.com/sailchannels"\n\t\t\t]\n\t }</script><script type="text/javascript" src="https://cdn.sailing-channels.com/1.15.9/main.1dad65fcb7a507930e1f.js"></script></body>']

My problem is I expect a lot more. When I do an inspect on chrome I see a lot more /div sections inside <div id="app"></div>

Could someone shine some light on what I'm doing wrong? I want to scrape the channel name, subscriber count, and views

Thanks

Upvotes: 1

Views: 303

Answers (1)

Pankaj
Pankaj

Reputation: 939

Understandable. It is because of they rendering the data through another script during loading of the page.

In normal scrapy setting, dynamic page loading content doesn't appear. For scraping that data you can use selenium.
selenium-with-scrapy-for-dynamic-page

For an alternative way, you can use splash for handling javascript enabled content.
handling-javascript-in-scrapy-with-splash

Upvotes: 2

Related Questions