Kode Charlie
Kode Charlie

Reputation: 1489

Retrieve fully populated dynamic content with PhantomJS

I downloaded pjscrape (running PhantomJS under the hood), and in fact, page queries returned fully populated content, including dynamic content. Unfortunately, pjscrape only emits JSON or CSV. I need HTML.

Using PhantomJS alone, I have this script (call is my-query.js):

var page = require('webpage').create();
page.open('http://www.sonoma.edu/calendar/groups/clubs.html', function (status) {
    console.log("status: " + status);
    if (status !== "success") {
      console.log("Unable to access network");
    } else {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js", function() {
          console.log("Got jQuery...");
          var fullyPopulatedContent = null;
          page.evaluate(function() {
              $(document).ready(function() {
                  fullyPopulatedContent = $("html").html();
                });
          });
          window.setTimeout(function() {
              console.log(fullyPopulatedContent);
            }, 10000);
      });
    }
  });

But this logic never sets fullyPopulatedContent after the page.evaluate is done. IE, fullyPopulatedContent is just always null.

This seems like such a trivial application that you would think PhantomJS would do it out of the box for free.

Any clues how to make such queries work, when the target URL comprises content dynamically populated via Ajax/javascript or frames? And if frames are involved, can you also please explain how PhantomJS navigates through frame content, as the online documentation and examples are not clear on that topic.

Upvotes: 1

Views: 4667

Answers (1)

Artjom B.
Artjom B.

Reputation: 61952

PhantomJS has two contexts. page.evaluate() is the only function that provides access to the DOM/page context. The function is sandboxed which is why you would need to explicitly pass data in and out.

Another problem is that the event that $(...).ready() listens on was probably triggered long before you call $.ready() inside of page.evaluate(). If that is the only reason why you wanted to load jQuery, then you shouldn't do it.

You could simply wait a static amount of time:

var page = require('webpage').create();
page.open('http://www.sonoma.edu/calendar/groups/clubs.html', function (status) {
    console.log("status: " + status);
    if (status !== "success") {
      console.log("Unable to access network");
    } else {
        window.setTimeout(function() {
            console.log(page.content);
            phantom.exit();
        }, 10000); // adjust time for every page
    }
});

The problem is of course, you cannot easily determine whether the page is fully loaded. A generally good approach is to waitFor (function from the examples) a specific condition like a final element appears or at least x elements of the same type are present in the page. This is usually done with CSS selectors using document.querySelector() through page.evaluate().

Another way would be to count requested resources and finished resources to see when there aren't any pending requests for a small amount of time and hope that the time between resource requests is appropriately selected.

Frames:

PhantomJS automatically fetches (i)frames as part of the page load. Though, they may finish loading later than the main/parent frame. That's why you might need an additional waiting period.

When you take a screenshot with page.render() you will see the complete page including the loaded (or currently loading) frames.

Since frames are separate documents which have their own document root, PhantomJS doesn't include them when you try to print the page source of the main/parent page with page.content. You first need to change into their context in order to print their DOM representation.

You can either do that by name (if the frame has a name) or by index (depending on the number of frames in the current (parent) frame). Use page.switchToFrame() for that. Then you can retrieve the frame content with page.frameContent. Since you switched into the frame context, now you can do all interaction that you could previously do in the main frame like freely changing the DOM or clicking on stuff. When you're done with the frame, then you can change back with page.switchToParentFrame().

Upvotes: 1

Related Questions