Clg9100
Clg9100

Reputation: 31

Scraping "lazy loading" images with Selenium

I'm working on a small web scraping project that uses selenium wherein I scrape some product information from a clothes website: (https://www.asos.com/us/search/?q=shirt), I've been able to get most of the product information after quite some trial and error but I'm having a bit of issues with scraping the src images from the page source - which I believe is relatively similar in approach as the other things I've been able to get like product name, price, url of product page, etc. The following is the code snippet where I try to scrape the images from the page:

imgsSrc = set()
containers = driver.find_elements(By.CLASS_NAME, "productMediaContainer_kmkXR")
     for container in containers:
          image = container.find_element(By.TAG_NAME, 'img')
          print(image.get_attribute('src'))
          imgsSrc.add(image.get_attribute('src'))

This works for about the first 8 products give or take, however it then fails - at least from the information I've been able to find of similar situations this could be due to the site using "lazy loading" for the img tags classes. Products on the page source usually ~ the 8th product entry have differing image tag classes to the effect of : Lazy img class name and so on and I think this is where it's failing to grab the rest of the images as they each are of this class thereon.

I'm unsure if it matters but in my case before scraping anything on the page my program is also clicking the load more button on the page until (if possible) I have ~ 216 products displayed and setting the filter used for the products as a user inputted filter.

A few things I've tried is having the driver scroll to the page end before scraping the images, but I'm unsure of if it's a case of the images only loading on the page source when they're in the viewport or not.

From my understanding, the div class ("productMediaContainer_kmkXR") I'm pulling from each product isn't lazy loaded, however the img class contained within it could be. (It's also possible that instead of an img tag there's a video class tag associated with the product that still has an image in a member called "poster" like so: Video Tag)

Currently, I'm just trying to figure out how to get ALL the images for the products on the page. I'm just unsure if it's due to not gradually scrolling the page while scraping or something else.

Upvotes: 1

Views: 894

Answers (2)

Clg9100
Clg9100

Reputation: 31

After quite a bit of testing frankensteining various solutions to similar problems the following is a solution I landed on for my purposes:

Firstly, my issue was seemingly that I needed to fully scroll the page with the webdriver slow enough so that each image could load as I BELIEVE they were indeed lazy loaded the following code snippet is what I used to do so, it's a bit slow but can likely be tweaked to be faster and still work:

        imgsSrc = []
        driver.execute_script("window.scrollTo(0, 0);") #Go to top of page
        SCROLL_PAUSE_TIME = 2 #How long to wait between scrolls
        while True:
            previous_scrollY = driver.execute_script('return window.scrollY')
            #driver.execute_script('window.scrollBy( 0, 400 )' ) #Alternative scroll, a bit slower but reliable
            html = driver.find_element(By.TAG_NAME, 'html')
            html.send_keys(Keys.PAGE_DOWN)
            html.send_keys(Keys.PAGE_DOWN)
            html.send_keys(Keys.PAGE_DOWN) #Faster scroll, inelegant but works (Could translate to value scroll like above)
            time.sleep(SCROLL_PAUSE_TIME) #Give images a bit of time to load by waiting

            # Calculate new scroll height and compare with last scroll height
            if previous_scrollY == driver.execute_script('return window.scrollY'):
                break

This should scroll to the bottom of the page slowly and in doing so allow the images in the containers to load for scraping, as I initially mentioned I knew the particular page I was using for testing COULD potentially have a video tag rather than an image, however an image could be pulled from a member variable of it called 'poster' the following is how I handled scraping the images and handling the video case:

        missingCount = 0 #How many images did we miss (Testing purposes)
        containers = driver.find_elements(By.CLASS_NAME, "productMediaContainer_kmkXR")
        print(len(containers)) #Make sure we're getting all the containers
        for container in containers:
            try:
                image = container.find_element(By.TAG_NAME, 'img')
                print(image.get_attribute('src'))
                imgsSrc.append(image.get_attribute('src'))
            except NoSuchElementException: #Ideally in this case it's a video rather than an image (otherwise we didn't give it time to load)              
                print("Whoops - Check if video")
                try:
                    image = container.find_element(By.TAG_NAME,'video')
                    print(image.get_attribute('poster'))
                    imgsSrc.append(image.get_attribute('poster'))
                except NoSuchElementException: #It wasn't a video - OR we didn't give it enough time to load
                    missingCount += 1
                    print("We're really broken")

        print(missingCount)

Thank you to everyone for their answers, and good luck to future readers who stumble upon this - I hope it's helpful, in my case I had to do a good deal of troubleshooting and piecing together of similar issues others were having.

Upvotes: 2

LetsScrapeData
LetsScrapeData

Reputation: 106

you can get product information, which includes url of images, from script part. then download images directly using url:

enter image description here

Upvotes: 1

Related Questions