Mike
Mike

Reputation: 24954

What is the most elegant way to do screen scraping in node.js?

I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:

  1. Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.

  2. Redirect following. I want each request to follow through redirects when a 302 status code is returned.

I came across two things which looked useful, but I couldn't use in the end:

Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?

Upvotes: 15

Views: 9324

Answers (3)

Clint
Clint

Reputation: 2891

It turns out someone made a phantomjs module for node.js:

https://github.com/sgentle/phantomjs-node

While phantom is fairly heavy, it also supports SSL, cookies, and everything else a typical browser supports (since it is a webkit browser, after all).

Give it a shot, it may be exactly what you are looking for.

Upvotes: 3

mikeal
mikeal

Reputation: 4057

i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.

feedback is welcome :)

Upvotes: 4

RobertPitt
RobertPitt

Reputation: 57258

You may want to check out https://github.com/mikeal/request from mikeal, I just spoke to him the chatroom and he says that it does not handle cookies at the moment but you can write a submodule to handle these for you in the meantime.

in regards to redirect it handles beautifully :)

Upvotes: 3

Related Questions