gseccles
gseccles

Reputation: 47

Scrape image src URLs using PhantomJS

I'm trying to get a list of all image src url's in a given webpage using PhantomJS. My understanding is that this should be extremely easy, but for whatever reason, I can't seem to make it work. Here is the code I currently have:

var page = require('webpage').create();
page.open('http://www.walmart.com');

page.onLoadFinished = function(){
    var images = page.evaluate(function(){
        return document.getElementsByTagName("img");
    });
    for(thing in a){
        console.log(thing.src);
    }
    phantom.exit();
}

I've also tried this:

var a = page.evaluate(function(){
    returnStuff = new Array;
    for(stuff in document.images){
        returnStuff.push(stuff);
    }
    return returnStuff;
});

And this:

var page = require('webpage').create();
page.open('http://www.walmart.com', function(status){
    var images = page.evaluate(function() {
        return document.images;
    });
    for(image in images){
        console.log(image.src);
    }
    phantom.exit();
});

I've also tried iterating through the images in the evaluate function and getting the .src property that way.
None of them return anything meaningful. If I return the length of document.images, there are 54 images on the page, but trying to iterate through them provides nothing useful.

Also, I've looked at the following other questions and wasn't able to use the information they provided: How to scrape javascript injected image src and alt with phantom.js and How to download images from a site with phantomjs

Again, I just want the source url. I don't need the actual file itself. Thanks for any help.

UPDATE
I tried using

var a = page.evaluate(function(){
    returnStuff = new Array;
    for(stuff in document.images){
        returnStuff.push(stuff.getAttribute('src'));
    }
    return returnStuff;
});

It threw an error saying that stuff.getAttribute('src') returns undefined. Any idea why that would be?

Upvotes: 2

Views: 3559

Answers (4)

boney dsilva
boney dsilva

Reputation: 21

I used the following code to get all images on the page loaded, the images loaded on the browser changed dimensions on the basis of the view port, Since i wanted the max dimensions i used the the maximum view port to get the actual image sizes.

Get All Images on Page USING Phantom JS Download All Images URL on Page USING Phantom JS

No Matter even if the image is not in a img tag below code you can retrieve the URL


Even Images from such scripts will be retrieved

            @media screen and (max-width:642px) {
                .masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
                    background-image: url(assets/images/bg_studentcc-750x879-sm.jpg);
                }
            }
            @media screen and (min-width:643px) {
                .masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
                    background-image: url(assets/images/bg_studentcc-1920x490.jpg);
                }
            }

        var page =  require('webpage').create();
        var url = "https://......";

        page.settings.clearMemoryCaches = true;
        page.clearMemoryCache();
        page.viewportSize = {width: 1280, height: 1024};

        page.open(url, function (status) { 

            if(status=='success'){      
                console.log('The entire page is loaded.............################');
            }
        });

        page.onResourceReceived = function(response) {      
            if(response.stage == "start"){
                var respType = response.contentType;

                if(respType.indexOf("image")==0){           
                    console.log('Content-Type : ' + response.contentType)
                    console.log('Status : ' + response.status)
                    console.log('Image Size in byte : ' + response.bodySize)
                    console.log('Image Url : ' + response.url)
                    console.log('\n');
                }       
            }
        };

Upvotes: 0

Vaviloff
Vaviloff

Reputation: 16838

@MayorMonty was almost there. Indeed you cannot return HTMLCollection.

As the docs say:

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Closures, functions, DOM nodes, etc. will not work!

Thus the working script is like this:

var page = require('webpage').create();

page.onLoadFinished = function(){
    
    var urls = page.evaluate(function(){
        var image_urls = new Array;
        var images = document.getElementsByTagName("img");
        for(q = 0; q < images.length; q++){
            image_urls.push(images[q].src);
        }
        return image_urls;
    });    
    
    console.log(urls.length);
    console.log(urls[0]);
    
    phantom.exit();
}

page.open('http://www.walmart.com');

Upvotes: 3

bren
bren

Reputation: 4334

document.images is not an Array of the nodes, it's a HTMLCollection, which is built off of an Object. You can see this if you for..in it:

for (a in document.images) {
  console.log(a)
}

Prints:

0
1
2
3
length
item
namedItem

Now, there are several ways to solve this:

  1. ES6 Spread Operator: This turns array-likes and iterables into arrays. Use like so [...document.images]
  2. Regular for loop, like an array. This takes advantage of the fact that the keys are labeled like an array:

    for(var i = 0; i < document.images.length; i++) {
      document.images[i].src
    }
    

And probably more, as well

Using solution 1 allows you to use Array functions on it, like map or reduce, but has less support (idk if the current version of javascript in phantom supports it).

Upvotes: 0

abhirathore2006
abhirathore2006

Reputation: 3745

i am not sure about direct JavaScript method but recently i used jQuery to scrape image and other data so you can write script in below style after injecting jQuery

$('.someclassORselector').each(function(){
     data['src']=$(this).attr('src');
   });

Upvotes: 0

Related Questions