Javascript to scrape its own instance of a page

Question

My javascript foo is a bit weak, is it possible to use javascript to scrape the page its on into a string? I don't want it to make another request for the webpage, I need it to read in itself and any other source sitting on the page which will include a unique token generated for each page request, hence the need for it to read in all data on that instance of the page.

It also needs to be everything on that page, including comments as I would like to create a md5 hash from it, is this possible at all?

The html that needs to be scraped is that representing the DOM after the page initially completes loading.

jfriend00 · Accepted Answer

Be careful with this. With javascript, you have access to all the objects of the page and you can fetch the HTML for the entire page. But, the HTML you fetch with javascript may or may not be the exact same HTML that came from the original page download. Some browsers (like older versions of IE) don't actually store the original HTML so when you ask for the innerHTML, they manufacture HTML from the objects on the page. When they do that, attributes may be in different order, quoting may be different, even capitalization of attribute names may be different.

So, if you really need an md5 hash of the original HTML page and need it to be accurate, you will have to request it again from the server (it will probably end up coming from the browser cache) and calculate your own md5 hash of what you download from that - you can't use innerHTML of the current document.

Javascript to scrape its own instance of a page

Answers (2)

Related Questions