user434885
user434885

Reputation: 2018

getting a webpage source code without actually accessing a page

There are lots of web pages which simply run a script without having any material on them. Is there anyway of seeing the page source without actually visiting the page because it just redirects you ?

Will using an html parser work to do this ? I'm using simpleHTMLdom to parse the page ?

Upvotes: 1

Views: 9787

Answers (7)

ThiefMaster
ThiefMaster

Reputation: 318508

In firefox you can use the view-source protocol to view only the sourcecode of a site without actually rendering it or executing JavaScripts on it.

Example: view-source:http://stackoverflow.com/q/5781021/298479 (copy it to your address bar)

Upvotes: 7

Farray
Farray

Reputation: 8528

If you need a quick & dirty fix, you could disable JavaScript and Meta redirects (Internet Explorer can disable these in the Internet Options dialog. Firefox can use the NoScript add-in for same effect.)

This won't any server-side redirects, but will prevent client-side redirects and allow you to see the document's HTML source.

Upvotes: 1

scunliffe
scunliffe

Reputation: 63588

If you are trying to HTML-Scrape the contents of a page that builds 90%+ of its content/view through executing JavaScript you are going to encounter issues unless you are rendering to a screen (hidden) and then scraping that. Otherwise you'll end up scraping a few script tags which does you little good.

e.g. If I try to scrape my Gmail inbox page, it is an empty HTML page with just a few scattered script tags (likely typical of almost all GWT based apps)

Does the page/site you are scraping have an API? If not, is it worth asking them if they have one in the works?

Typically these types of tools run along a fine line between "stealing" information and "sharing" information thus you may need to tread lightly.

Upvotes: 0

user724081
user724081

Reputation: 11

wget or lynx will also work well if you have access to a command line linux shell:

wget http://myurl lynx -dump http://myurl

Upvotes: 0

JSager
JSager

Reputation: 1430

If you're on a *nix based operating system, try using curl from the terminal.

curl http://www.google.com

Upvotes: 0

Alex Netkachov
Alex Netkachov

Reputation: 13522

The only way to get the page HTML source is to send HTTP request to the web server and receive answer which is equal to visiting the page.

Upvotes: 0

neeebzz
neeebzz

Reputation: 11538

Yes, simple parsing the HTML will get you the client-side (Javascript) code.

When these pages are accessed through a browser, the browser runs the code and redirects it but when you access it using a scraper or your own program, the code is not run and static script can be obtained.

Ofcourse you can't access the server side (php). That's impossible.

Upvotes: 1

Related Questions