NiLL
NiLL

Reputation: 13843

How to most efficiently parse a web page using Node.js

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

Upvotes: 29

Views: 27953

Answers (4)

Max Heiber
Max Heiber

Reputation: 15502

I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.

Upvotes: 3

Mustafa
Mustafa

Reputation: 1776

If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.

Upvotes: 0

JP Richardson
JP Richardson

Reputation: 39395

I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});

Upvotes: 59

jabclab
jabclab

Reputation: 15042

You could try PhantomJS. Here's the documentation for using it for screen scraping.

Upvotes: 4

Related Questions