Reputation: 661
I'm trying to write a webscraper, to get some sales leads. The problem is that in modern webdesign, most of websites uses some JavaScript to modify DOM (usually using React, Angular, or even just some jQuery). The problem is, that if I scrap some website by request
node.js package, and pass html code to cheerio
, then I'm simply not able to parse the code and get the info I want. Instead, all I can see are some React.js components ¯_ツ_/¯
Any resources on this topic will be helpful, thanks in advance.
Upvotes: 0
Views: 217
Reputation: 33216
Because the request package will not execute any of the javascript on the page. It will just download the html as is. If you want to see the actual page like a browser does, you would have to create a javascript parser that executes all javascript code in the state you want it to.
Luckily, there are some other options here:
You could take a look at the developer tools on the website you want to scrape and try to find the xhr requests that fetches the data you need. Then you can call this url directly.
You could use headless browser scraping like PhantomJS or CasperJS. These are packages that will try and modify the downloaded dom as good as possible with the included javascript resources.
Upvotes: 1