Reputation: 385
I am working on integrating the PhantomJS headless browser into a project of mine (currently using version 1.6). For the most part, it is doing a great job at accomplishing that I need to accomplish. However, the asynchronous nature of the way that WebPage.open() calls work, and the need to call phantom.exit() at some point, makes it tricky to handle client side redirects when you can't anticipate where they're going to go.
What I'm after is a way to call phantom.exit() only after any meta refreshes (that lead to a different page) and JavaScript redirects tied to things like onload events have been executed. I can see why this is an issue, because in theory a client side redirect could take place any number of seconds after a page load, and I can't simply ask for the ability to exit only when no more redirects are going to take place. Right now, the best solution I can think of is to a) manually detect the presence of meta refresh elements on the page and deal with those myself, and b) use setInterval() to allow some sane amount of time (say, 1-1.5 seconds) to elapse before calling phantom.exit(). It would basically look like this:
var page = require('webpage').create();
var visitComplete = false;
var url = "http://some.url";
var pageOpenedTime;
setInterval(function() {
if (visitcomplete && typeof pageOpenedTime != 'undefined' &&
new Date() - pageOpenedTime >= 1500)
{
phantom.exit();
}
), 1000);
page.open(url, function() {
pageOpenedTime = new Date();
if (!hasMetaRefresh(page)) {
visitComplete = true;
}
});
function hasMetaRefresh(page) {
// Query the DOM here to detect meta refresh elements
}
Any better ideas?
Edit: I should mention that my first thought was that there might be a PhantomJS event that gets fired when the JavaScript associated with the initial page load has been executed, but the onLoadFinished callback appears to precede the execution of any in-page JavaScript, including onload events. I also did some testing about how much of an interval I might need to wait, and while 1000 ms was long enough for a JavaScript redirect (via body onload event) to get executed in a small test page, 100 ms was not long enough.
Upvotes: 21
Views: 7103
Reputation: 13166
I already checkout various of example for phantomjs redirect handling : tough luck.
For the time being , there is no universal fix for it. If you patch some script as suggested here ,it will failed under other scenario , e.g. beside using location.href, using javascript to redirect. I haven't tested the body yet. After a few money patching here and there, I give up.
I just use the "heavy" Selenium triggered firefox to solve my issues. If you need to load many pages, instead of restart firefox, just use webdriver.delete_all_cookies()
to clean up some catch. It give me reliable results (which I need to do screen capture, download the html, get the final url, and many more) compare to phantomjs .
Upvotes: 0
Reputation: 101
I've had the same issue loading a page that was using Optimizely, and the variation was a location.href redirect.
I now use the onNavigationRequest callback inside a "renderPage" function. Those optimizely redirects no longer block and I don't need an arbitrary timeout.
var webpage = require('webpage');
var page = null;
var renderPage = function (myurl) {
page = webpage.create();
page.onNavigationRequested = function(url, type, willNavigate, main) {
if (main && url!=myurl && url.replace(/\/$/,"")!=myurl&& (type=="Other" || type=="Undefined") ) {
// main = navigation in main frame; type = not by click/submit etc
log("\tfollowing "+myurl+" redirect to "+url)
myurl = url;
page.close();
renderPage(url); // rerun this function wit the new URL
}
}; // on Nav req
page.open(myurl, function(status) {
if (status==="success") {
page.render("screenshot.jpg");
} else {
page.close();
}
}); // page open
} // render page
renderPage("http://some.domain.com");
see docs: http://phantomjs.org/api/webpage/handler/on-navigation-requested.html
Upvotes: 8
Reputation: 3917
I have the idea to use mocked timers for this purpose. Suppose we include "a mocked timer" in the page. That way, you may fast-forward time to avoid the js idle time. See the examples on the GitHub page.
This is just an approach to make things happen faster, but as you would expect, it's not possible to make sure if a redirection event would be fired in future.
Upvotes: 0