Dan Lew
Dan Lew

Reputation: 87430

Are there command line or library tools for rendering webpages that use JavaScript?

Page-scraping on the Internet has seem to have hit somewhat of a wall for me, as there are more and more sites that are dependent on JavaScript for rendering portions of the screen.

It seems to me that with so many open source layout and JavaScript renderers released (like WebKit, Gecko and Chromium + V8) that someone must have made a tool for downloading a page and rendering its JavaScript without having to run an actual browser. However, I'm not turning up what I'm looking for with my searches - I've found tools like Selenium-rc, but they depend on a running browser. I'm interested in any tool or library which can do one (or both) of the following:

  1. A program that can be run from the command line (*nix) which, given the source of a page, returns the page's source as rendered by some JS engine.

  2. Integrated support in a particular language that allows one to (easily) pass the source of a page to it and returns the page's source as rendered by some JS engine.

I think #1 is preferable in a general sense, but #2 would be more useful if the tool exists in the language I want to work in. Also, I'm not concerned with the particular JS engine - any relatively modern one will do. What is out there?

Upvotes: 18

Views: 5924

Answers (8)

h4ck3rm1k3
h4ck3rm1k3

Reputation: 2100

web kit html to pdf works perfect, it can even produce jpg

http://wkhtmltopdf.googlecode.com

Upvotes: 4

Ben Combee
Ben Combee

Reputation: 17427

Since JavaScript can do quite a lot of manipulations to the web page's document object model (DOM), it seems like to accurately scrape the content of an arbitrary page, you'd need to not only run a JavaScript engine, you'd also need a complete and accurate DOM representation of the page. That's something you'll only get if you have a real browser engine instantiated. It is possible to use an embedded, not-displayed WebKit or Gecko engine for this, then after a suitable loading delay to allow for script execution, just dump the DOM contents in HTML form.

Upvotes: 2

Tobias
Tobias

Reputation: 4292

It's very little code to have a WebView render a page without displaying anything, but it has to be a GUI application. They can take command line arguments as well, and hide the window. Using WebKit directly it might be possible in a tool.

Apart from the complicated DOM access in Objective-C WebKit can also inject JavaScript, and together with jQuery that makes for a nice scraping solution. I don't know of any universal application doing that, though.

Upvotes: 0

Javier
Javier

Reputation: 62583

i think there's an example code for Qt that uses the included WebKit to render a page to a pixmap. from there to a full CLI utility is just defining your needs.

of course, for most screen-scraping need you want the text, not a pixmap... if that's what you want, better check Rhino

Upvotes: 1

Brian Campbell
Brian Campbell

Reputation: 332816

Well, there's the DumpRenderTree tool which is used as part of the WebKit test suites. I'm not sure how suitable it is for turning into a standalone tool, but it does what you ask for (render HTML, run JavaScript, and dump its render tree out to disk).

Upvotes: 2

Seb
Seb

Reputation: 25147

We used Rhino sometime ago to do some automated testing from Java. It seems it'll do the job for you :)

Upvotes: 1

Serg
Serg

Reputation: 2946

You can look at HTMLUnit. It's main purpose is automatic web testing, but I think it may let you get the rendered page.

Upvotes: 2

David
David

Reputation: 3227

There is the Cobra Engine for Java (http://lobobrowser.org/cobra.jsp), which handles Javascript (it also has a renderer, but that is optional). I've never used it, but have heard nice things said about it.

Upvotes: 0

Related Questions