Jared Carter
Jared Carter

Reputation: 21

HTML output from PhantomJS and Google Chrome/Firefox are different

I've been debugging this for a long time and it has me completely baffled. I need to save ads to my computer for a work project. Here is an example ad that I got from CNN.com:

http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no&params.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=

When I visit this link in Google Chrome and Firefox, I see an ad (if the link stops working, simply go to CNN.com and grab the iframe URL for one of the ads). I developed a PhantomJS script that will save a screenshot and the HTML of any page. It works on any website, but it doesn't seem to work on these ads. The screenshot is blank and the HTML contains a tracking pixel (a 1x1 transparent gif used to track the ad). I thought that it would give me what I see in my normal browser.

The only thing that I can think of is that the AJAX calls are somehow messing up PhantomJS, so I hard-coded a delay but I got the same results.

Here is the most basic piece of test code that reproduces my problem:

var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    }
    else {
        // Output Results Immediately
        var html = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML;
        });
        fs.write("HtmlBeforeTimeout.htm", html, 'w');
        page.render('RenderBeforeTimeout.png');

        // Output Results After Delay (for AJAX)
        window.setTimeout(function () {
            var html = page.evaluate(function () {
                return document.getElementsByTagName('html')[0].innerHTML;
            });
            fs.write("HtmlAfterTimeout.htm", html, 'w');
            page.render('RenderAfterTimeout.png');
            phantom.exit();
        }, 9000); // 9 Second Delay 
    }
});

You can run this code using this command in your terminal:

phantomjs getHtml.js 'http://www.google.com/'

The above command works well. When you replace the Google URL with an Ad URL (like the one at the top of this post), is gives me the unexpected results that I explained.

Thanks so much for your help! This is my first question that I've ever posted on here, because I can almost always find the answer by searching Stack Overflow. This one, however, has me completely stumped! :)

EDIT: I'm running PhantomJS 1.9.7 on Ubuntu 14.04 (Trusty Tahr)

EDIT: Okay, I've been working on it for a while now and I think it has something to do with cookies. If I clear all of my history and view the link in my browser, it also comes up blank. If I then refresh the page, it displays fine. It also displays fine if I open it in a new tab. The only time it doesn't is when I try to view it directly after clearing my cookies.

EDIT: I've tried loading the link twice in PhantomJS without exiting (manually requesting it twice in my script before calling phantom.exit()). It doesn't work. In the PhantomJS documentation it says that the cookie jar is enabled by default. Any ideas? :)

Upvotes: 2

Views: 1971

Answers (1)

Cameron Tinker
Cameron Tinker

Reputation: 9789

You should try using the onLoadFinished callback instead of checking for status in page.open. Something like this should work:

var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];

page.open(url);

page.onLoadFinished = function()
{
    // Output Results Immediately
    var html = page.evaluate(function () {
        return document.getElementsByTagName('html')[0].innerHTML;
    });
    fs.write("HtmlBeforeTimeout.htm", html, 'w');
    page.render('RenderBeforeTimeout.png');

    // Output Results After Delay (for AJAX)
    window.setTimeout(function () {
        var html = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML;
        });
        fs.write("HtmlAfterTimeout.htm", html, 'w');
        page.render('RenderAfterTimeout.png');
        phantom.exit();
    }, 9000); // 9 Second Delay 
};

I have an answer here that loops through all files in a local folder and saves images of the resulting pages: Using Phantom JS to convert all HTML files in a folder to PNG The same principle applies to remote HTML pages.

Here is what I have from the output:
Before Timeout:
https://i.sstatic.net/GmsH9.jpg

After Timeout:
https://i.sstatic.net/mo6Ax.jpg

Upvotes: 1

Related Questions