Reputation: 47
I'm trying to get a list of all image src url's in a given webpage using PhantomJS. My understanding is that this should be extremely easy, but for whatever reason, I can't seem to make it work. Here is the code I currently have:
var page = require('webpage').create();
page.open('http://www.walmart.com');
page.onLoadFinished = function(){
var images = page.evaluate(function(){
return document.getElementsByTagName("img");
});
for(thing in a){
console.log(thing.src);
}
phantom.exit();
}
I've also tried this:
var a = page.evaluate(function(){
returnStuff = new Array;
for(stuff in document.images){
returnStuff.push(stuff);
}
return returnStuff;
});
And this:
var page = require('webpage').create();
page.open('http://www.walmart.com', function(status){
var images = page.evaluate(function() {
return document.images;
});
for(image in images){
console.log(image.src);
}
phantom.exit();
});
I've also tried iterating through the images in the evaluate function and getting the .src property that way.
None of them return anything meaningful. If I return the length of document.images, there are 54 images on the page, but trying to iterate through them provides nothing useful.
Also, I've looked at the following other questions and wasn't able to use the information they provided: How to scrape javascript injected image src and alt with phantom.js and How to download images from a site with phantomjs
Again, I just want the source url. I don't need the actual file itself. Thanks for any help.
UPDATE
I tried using
var a = page.evaluate(function(){
returnStuff = new Array;
for(stuff in document.images){
returnStuff.push(stuff.getAttribute('src'));
}
return returnStuff;
});
It threw an error saying that stuff.getAttribute('src') returns undefined. Any idea why that would be?
Upvotes: 2
Views: 3559
Reputation: 21
I used the following code to get all images on the page loaded, the images loaded on the browser changed dimensions on the basis of the view port, Since i wanted the max dimensions i used the the maximum view port to get the actual image sizes.
Get All Images on Page USING Phantom JS Download All Images URL on Page USING Phantom JS
No Matter even if the image is not in a img tag below code you can retrieve the URL
Even Images from such scripts will be retrieved
@media screen and (max-width:642px) {
.masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
background-image: url(assets/images/bg_studentcc-750x879-sm.jpg);
}
}
@media screen and (min-width:643px) {
.masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
background-image: url(assets/images/bg_studentcc-1920x490.jpg);
}
}
var page = require('webpage').create();
var url = "https://......";
page.settings.clearMemoryCaches = true;
page.clearMemoryCache();
page.viewportSize = {width: 1280, height: 1024};
page.open(url, function (status) {
if(status=='success'){
console.log('The entire page is loaded.............################');
}
});
page.onResourceReceived = function(response) {
if(response.stage == "start"){
var respType = response.contentType;
if(respType.indexOf("image")==0){
console.log('Content-Type : ' + response.contentType)
console.log('Status : ' + response.status)
console.log('Image Size in byte : ' + response.bodySize)
console.log('Image Url : ' + response.url)
console.log('\n');
}
}
};
Upvotes: 0
Reputation: 16838
@MayorMonty was almost there. Indeed you cannot return HTMLCollection.
As the docs say:
Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!
Thus the working script is like this:
var page = require('webpage').create();
page.onLoadFinished = function(){
var urls = page.evaluate(function(){
var image_urls = new Array;
var images = document.getElementsByTagName("img");
for(q = 0; q < images.length; q++){
image_urls.push(images[q].src);
}
return image_urls;
});
console.log(urls.length);
console.log(urls[0]);
phantom.exit();
}
page.open('http://www.walmart.com');
Upvotes: 3
Reputation: 4334
document.images
is not an Array of the nodes, it's a HTMLCollection
, which is built off of an Object
. You can see this if you for..in
it:
for (a in document.images) {
console.log(a)
}
Prints:
0
1
2
3
length
item
namedItem
Now, there are several ways to solve this:
[...document.images]
Regular for
loop, like an array. This takes advantage of the fact that the keys are labeled like an array:
for(var i = 0; i < document.images.length; i++) {
document.images[i].src
}
And probably more, as well
Using solution 1 allows you to use Array functions on it, like map
or reduce
, but has less support (idk if the current version of javascript in phantom supports it).
Upvotes: 0
Reputation: 3745
i am not sure about direct JavaScript method but recently i used jQuery to scrape image and other data so you can write script in below style after injecting jQuery
$('.someclassORselector').each(function(){
data['src']=$(this).attr('src');
});
Upvotes: 0