Reputation: 3366
The basic idea is that the web-application fetches an external website and overlays it some JavaScript, for additional functionality.
However the links on the webpage I fetched shouldn't navigate to the external website, but stay on my website. I figured that converting the links with regular expressions (or a similar method) would be inefficient as it would not cover the links dynamically generated, like AJAX-requests or other JavaScript functionality. So basically what I can't seem to find is a method to change/intercept/redirect all links of the scraped website.
So, what is a (good) way to change/intercept the dynamically generated links of a scraped website? Preferably a python-method.
Upvotes: 0
Views: 167
Reputation: 29710
Unless you're changing the URL's on the scraped web page (including dynamic ones), you can't do what you ask.
If a client is served up a web page with a URL pointing to an external site, your website will have no opportunity to intercept this or change it since their browser will navigate away without even going to your site (not strictly true though - read on). Theoretically, you could possibly attach event handlers to all links (before serving up the scraped page), and even intercept dynamically created ones (by parsing their javascript), but this might prove pretty difficult. You also have to stop other methods of the URL changing as well (like header redirection).
Clients themselves can use proxies in their browsers (that affect all outgoing URLs), but this is the client deciding that all traffic should be routed through a proxy server. You can't do this on their behalf (without actually changing the URLs).
EDIT: Since OP removed suggestion of using a web proxy, the answer details change slightly, but the end result is the same. For all practical purposes, it's nearly impossible to do this.
You could try parsing the javascript on the page and be successful for some pages (or possibly with a sophisticated enough script for many typical pages); but throw in one little eval
on the page, and you'll need your own javascript engine written in javascript to try to figure out every possible external request on a page. ...and even then you couldn't do it.
Basically, give me a script which someone says can parse any webpage (including javascript) to intercept any external calls, and I'll give you a webpage that this script won't work for. Disclaimer: I'm speaking about intercepting the links, but letting the site function normally after...not just parsing the page to remove all javascript entirely.
Someone else may be able to provide you an answer that works sometimes on some web pages - maybe that would be good enough for your purposes.
Also, have you considered that most javascript on a page isn't embedded, but rather either loaded via <script>
tags, or possibly even loaded dynamically, from the original server. I assume you'd want to distinguish "stuff loaded from original server needed to make page function and look correctly", from "stuff loaded from original server for other things". How does your program "know" this?
You could try parsing the page and removing all javascript...but even this would be very difficult, since there are still tricky ways of getting around this.
Upvotes: 2