Reputation: 23
I'm interested in writing a script, preferably one easy to add on to browsers with tools such as Greasemonkey, that sends a page's HTML source code to an external server, where it will later be parsed and useful data would be sent to a database.
However, I haven't seen anything like that and I'm not sure how to approach this task. I would imagine some sort of HTTP post would be the best approach, but I'm completely new to those ideas, and I'm not even exactly where to send the data to parse it (it doesn't make sense to send an entire HTML document to a database, for instance).
So basically, my overall goal is something that works like this (note that I only need help with steps 1 and 2. I am familiar with data parsing techniques, I've just never applied them to the web):
Any tips or help is greatly appreciated, thank you!
Edit: Code
ihtml = document.body.innerHTML;
GM_xmlhttpRequest({
method:'POST',
url:'http://www.myURL.com/getData.php',
data:"SomeData=" + escape(ihtml)
});
Edit: Current JS Log:
Namespace/GMScriptName: Server Response: 200
OK
4
Date: Sun, 19 Dec 2010 02:41:55 GMT
Server: Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
Array
(
)
http://www.url.com/getData.php
Upvotes: 2
Views: 1335
Reputation: 26766
As mentioned in the comment on your Q, I'm not convinced this is a good idea and personally, I'd avoid any extension that did this like the plague but...
You can use the innerHTML
property available on all html elements to get the HTML inside that node - eg the body element. You could then use an AJAX HTTP(S!) request to post the data.
You might also want to consider some form of compression as some pages can be very large and most users have better download speeds than upload speeds.
NB: innerHTML
gets a representation of the source code that would display the page in its current state, NOT the actual source that was sent from the web server - eg if you used JS to add an element, the source for that element would be included in innerHTML
even though it was never sent across the web.
An alternative would be to use an AJAX request to GET the current URL and send yourself the response. This would be exactly what was sent to the client but the server in question will be aware the page was served twice (and in some web applications that may cause problems - e.g. by "pressing" a delete button twice)
one final suggestion would be to simply send the current URL to yourself and do the download on your own servers - This would also mitigate some of the security risks as you wouldn't be able to retrieve the content for pages which aren't public
EDIT:
NB: I've deleted much spurious information which was used in tracking down the problem, check the edit logs if you want full details
PHP Code:
<?php
$PageContents = $_POST['PageContents']
?>
GreaseMonkey script:
var ihtml = document.body.innerHTML;
GM_xmlhttpRequest({
method:'POST',
url:'http://example.com/getData.php',
data:"PageContents=" + escape(ihtml),
headers: {'Content-type': 'application/x-www-form-urlencoded'}
});
Upvotes: 3