Trowalts
Trowalts

Reputation: 119

Javascript to scrape its own instance of a page

My javascript foo is a bit weak, is it possible to use javascript to scrape the page its on into a string? I don't want it to make another request for the webpage, I need it to read in itself and any other source sitting on the page which will include a unique token generated for each page request, hence the need for it to read in all data on that instance of the page.

It also needs to be everything on that page, including comments as I would like to create a md5 hash from it, is this possible at all?

The html that needs to be scraped is that representing the DOM after the page initially completes loading.

Upvotes: 0

Views: 105

Answers (2)

SeanCannon
SeanCannon

Reputation: 77976

var myHTML = document.documentElement.outerHTML;

Demo, with an example of Marc B's idea not providing the desired result: http://jsfiddle.net/AlienWebguy/hu2Mj/

Upvotes: 1

jfriend00
jfriend00

Reputation: 707406

Be careful with this. With javascript, you have access to all the objects of the page and you can fetch the HTML for the entire page. But, the HTML you fetch with javascript may or may not be the exact same HTML that came from the original page download. Some browsers (like older versions of IE) don't actually store the original HTML so when you ask for the innerHTML, they manufacture HTML from the objects on the page. When they do that, attributes may be in different order, quoting may be different, even capitalization of attribute names may be different.

So, if you really need an md5 hash of the original HTML page and need it to be accurate, you will have to request it again from the server (it will probably end up coming from the browser cache) and calculate your own md5 hash of what you download from that - you can't use innerHTML of the current document.

Upvotes: 1

Related Questions