Ben Packard
Ben Packard

Reputation: 26476

Access elements on an external page

I have an html page that is being accessed via a link that places an external page in the url - e.g.

http://www.mydomain.com/mypage?external-page=encodedURL

It is the responsibility of my page to scrape some data from the URL it is handed.

How can I access the passed-in page using javascript/jquery? I need to be able to pull out the content for certain classes and ids.

Is this a violation of same origin policy? If so, is there some other way to process an external page like this? Seems strange to me that I can hit the web page in a browser or a terminal command and receive the content, but not in a js file.

Upvotes: 0

Views: 1799

Answers (2)

Cris Stringfellow
Cris Stringfellow

Reputation: 3808

You can use a browser extension to scrape the external page, then send the data to your site, OR display it within the page, so that it can then be accessed by your page's javascript via the DOM.

You can use a proxy on your domain which fetches the external page and hands it to your javascript whose origin is on your domain, too.

You can use an API for the external page which is accessible.

You can ask,command, change the code of the external page (if you have access to it) to serve pages with Access-Control-Allow-Origin=*

I think this is all you can do.

EDIT: The "seems strange" is until you realize the intended difference between a user, and a process. The user is not thought to be malicious, but a process could be. A process could for example, grab data from a user's logged in gmail session if it had access to the external page, and transmit that data to a server. Since the user on the terminal is probably (but not always !) the one who logged in to that session, the user is not thought to be malicious. But a script whose origin is some website that user navigates to, should not be able to act with the same permissions as that user. Since that script is an agent as well, and can make actions, but it is not created or directed by the user. That's the strongest reason for the isolation of origin's and the same origin policy.

Example

Execution Context of Bookmarklets, and IFrames If you are injecting JS into every page via a bookmarklet, then that injected code will behave as if it has the same origin as the rest of the page, or at least the "top frame" of that page. It will execute in the same context as the top frame. If there are nested iframes in the page then you will get an "unsafe attempt to access page x from " error if your bookmarklet tries to inject into there. This is because the bookmarklet has it's origin in the top page, and the top page can never access nested iframes on different domains anyway.

So if some part of the site you wish to scrape is in an iframe below the top frame, your bookmarklet will fail to get it.

Transmitting Data using a bookmarklet If you want to take a url on one page, on your domain, then grab data from that url, on another domain, then display that data back on the same page, you need a way to get the data across. You could use a bookmarklet but the flow would still involve some "user help". It would go something like this:

  1. Load your domain's page, D. User puts a url into an input box. Clicks submit.
  2. Javascript on D opens a new tab/window pointing to the user provided url.
  3. User clicks your scraping bookmarklet on that external page, which collects the desired data, X.
  4. Desired data, X, is sent via Ajax to a "server", S, with session identifier I.
  5. Page D, polls the server S, until it gets notified that some data with session identifier I has been grabbed, then it gets that data and displays it on D.

There is the need for a server. You can't use local storage to transmit the information since this is specific to a domain. There is an alterative that does not require a server. It requires making a browser extension.

Transmitting data using a browser extension The "background page" of the extension is basically the same as a local server for all the browser tabs, it permits transmitting of information across tabs targeted to different domains. The "clients" in this set up are the "content scripts", which are loaded to every page (just like a bookmarklet, except without the requirement for a user to actually click the bookmarklet to load it. It happens automatically). The flow would go like this:

  1. Page D again. User inputs url in input box. Clicks submit -> which triggers some code in the extension.
  2. The extension background page instructs a tab to open and targets it to the url.
  3. A content script loads automatically into that tab, checks with the background what data it should get. It gets that data, and sends it, via a message (a json string) to the background page.
  4. The background page pushes that notification and the data on to the original contents script on page D. Which displays the information.
  5. Optionally, the background page also transmits the information to your server for saving into that user's datastore.

The language I use for the browser extension "background page" and "content script" is pretty much focussed on Google Chrome. The same concepts are available in Safari, Firefox as well. If you want to support IE you're going to have to work out something else. IE10 does not plan to even support extensions.

Upvotes: 1

Amy
Amy

Reputation: 7466

If the external page and your page is on the same domain, then you should be able to access that external page using JavaScript. Otherwise, the JavaScript won't be allowed to access the external site, browsers will prevent Cross-site scripting.

Upvotes: 1

Related Questions