Reputation: 2018
There are lots of web pages which simply run a script without having any material on them. Is there anyway of seeing the page source without actually visiting the page because it just redirects you ?
Will using an html parser work to do this ? I'm using simpleHTMLdom to parse the page ?
Upvotes: 1
Views: 9787
Reputation: 318508
In firefox you can use the view-source protocol to view only the sourcecode of a site without actually rendering it or executing JavaScripts on it.
Example: view-source:http://stackoverflow.com/q/5781021/298479 (copy it to your address bar)
Upvotes: 7
Reputation: 8528
If you need a quick & dirty fix, you could disable JavaScript and Meta redirects (Internet Explorer can disable these in the Internet Options dialog. Firefox can use the NoScript add-in for same effect.)
This won't any server-side redirects, but will prevent client-side redirects and allow you to see the document's HTML source.
Upvotes: 1
Reputation: 63588
If you are trying to HTML-Scrape the contents of a page that builds 90%+ of its content/view through executing JavaScript you are going to encounter issues unless you are rendering to a screen (hidden) and then scraping that. Otherwise you'll end up scraping a few script tags which does you little good.
e.g. If I try to scrape my Gmail inbox page, it is an empty HTML page with just a few scattered script tags (likely typical of almost all GWT based apps)
Does the page/site you are scraping have an API? If not, is it worth asking them if they have one in the works?
Typically these types of tools run along a fine line between "stealing" information and "sharing" information thus you may need to tread lightly.
Upvotes: 0
Reputation: 11
wget or lynx will also work well if you have access to a command line linux shell:
wget http://myurl lynx -dump http://myurl
Upvotes: 0
Reputation: 1430
If you're on a *nix based operating system, try using curl from the terminal.
Upvotes: 0
Reputation: 13522
The only way to get the page HTML source is to send HTTP request to the web server and receive answer which is equal to visiting the page.
Upvotes: 0
Reputation: 11538
Yes, simple parsing the HTML will get you the client-side (Javascript) code.
When these pages are accessed through a browser, the browser runs the code and redirects it but when you access it using a scraper or your own program, the code is not run and static script can be obtained.
Ofcourse you can't access the server side (php). That's impossible.
Upvotes: 1