Reputation: 13843
I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.
P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.
Upvotes: 29
Views: 27953
Reputation: 15502
I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.
Upvotes: 3
Reputation: 1776
If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.
Upvotes: 0
Reputation: 39395
I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.
What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.
I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.
Hope this helps.
Snippet using Request and Cheerio:
var request = require('request')
, cheerio = require('cheerio');
var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;
request(url, function(err, resp, body){
$ = cheerio.load(body);
links = $('.sb_tlst h3 a'); //use your CSS selector here
$(links).each(function(i, link){
console.log($(link).text() + ':\n ' + $(link).attr('href'));
});
});
Upvotes: 59
Reputation: 15042
You could try PhantomJS. Here's the documentation for using it for screen scraping.
Upvotes: 4