Reputation: 63619
I am trying to use PhantomJS to load a page (that uses Javascript to load items on the webpage) and returns all the HTML on the page (at least within the <body />
tags) to the PHP function that executes phantomjs httpget.js
.
Problem: I can get phantomjs to return the document.title
, but asking it to console.log(document.body)
simple gives me a [object Object]
. How can I extract the page's HTML?
It also takes much longer to load the webpage using phantomjs compared to the browser.
httpget.js
console.log('hello!');
var page = require('webpage').create();
page.open("http://www.asos.com/Men/T-Shirts-Vests/Cat/pgecategory.aspx?cid=7616#parentID=-1&pge=0&pgeSize=900&sort=1",
function(status){
console.log('Page title is ' + page.evaluate(function () {
return document.body;
}));
phantom.exit();
});
Output (running from shell)
hello!
Page title is [object Object]
Upvotes: 2
Views: 2525
Reputation: 12561
Read the documentation, page.content
gets you the entire HTML.
Upvotes: 0
Reputation: 2852
Not sure what this has to do with Node.js as you appear to be using PhantomJS directly, not node (or phantom via node-phantom)...
But to answer your question, you need to do this:
var html = page.evaluate(function () {
var root = document.getElementsByTagName("html")[0];
var html = root ? root.outerHTML : document.body.innerHTML;
return html
});
This works with pages that don't have an outer <html> tag.
Upvotes: 2