Sushil
Sushil

Reputation: 5531

Selenium Python Instagram Scraping All Images in a post not working

I am writing a small code to download all images/videos in a post. Here is my code:

import urllib.request as reqq
from selenium import webdriver
import time

browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")

browser.get(url)

browser.maximize_window()

url_list = ['https://www.instagram.com/p/CE9CZmsghan/']

img_urls = []
vid_urls = []
img_url = ""
vid_url = ""
    
for x in url_list:    
    count = 0   
    
    browser.get(x)
    
    while True:       
        
        try:
            elements = browser.find_elements_by_class_name('_6CZji')
            elements[0].click()
            time.sleep(1)
        except:
            count+=1
            time.sleep(1)
            if count == 2:
                break
        try:
            vid_url = browser.find_element_by_class_name('_5wCQW').find_element_by_tag_name('video').get_attribute('src')
            vid_urls.append(vid_url)
        except:
            img_url = browser.find_element_by_class_name('KL4Bh').find_element_by_tag_name('img').get_attribute('src')
            img_urls.append(img_url)

for x in range(len(img_urls)):
    reqq.urlretrieve(img_urls[x],f"D:\\instaimg"+str(x+1)+".jpg")
    
for x in range(len(vid_urls)):
    reqq.urlretrieve(vid_urls[x],"D:\\instavid"+str(x+1)+".mp4")

browser.close()

This code extracts all the images in the post except the last image. IMO, this code is right. Do you know why this code doesn't extract the last image? Any help would be appreciated. Thanks!

Upvotes: 1

Views: 1313

Answers (1)

Z4-tier
Z4-tier

Reputation: 7978

Go to the URL that you're using in the example and open the inspector, and very carefully watch how the DOM changes as you click between images. There are multiple page elements with class KL4Bh because it tracks the previous image, the current image, and the next image.

So doing find_element_by_class_name('KL4Bh') returns the first match on the page.

Ok, lets break down this loop and see what is happening:

first iteration
    page opens
    immediately click 'next' to second photo
    grab the first element for class 'KL4Bh' from the DOM
    the first element for that class is the first image (now the 'previous' image)
[... 2, 3, 4 same as 1 ...]
fifth iteration
    look for a "next" button to click
    find no next button
    `elements[0]` fails with index error
    grab the first element for class 'KL4Bh' from the DOM
    the first element for that class is **still the fourth image**
sixth iteration
    look for a "next" button to click
    find no next button
    `elements[0]` fails with index error
    error count exceeds threshold
    exit loop

try something like this:

    n = 0
    while True:
        try:
            elements = browser.find_elements_by_class_name('_6CZji')
            elements[0].click()
            time.sleep(1)
        except IndexError:
            n=1
            count+=1
            time.sleep(1)
            if count == 2:
                break
        try:
            vid_url = browser.find_elements_by_class_name('_5wCQW')[n].find_element_by_tag_name('video').get_attribute('src')
            vid_urls.append(vid_url)
        except:
            img_url = browser.find_elements_by_class_name('KL4Bh')[n].find_element_by_tag_name('img').get_attribute('src')
            img_urls.append(img_url)

it will do the same thing as before, except since it's now using find_elements_by_class and indexing into the resulting list, when it gets to the last image the index error for the failed button click will also cause the image lookup to increment the index it uses. So it will take the second element (the current image) on the last iteration of the loop. There are still some serious problems with this code, but it does fix the bug you are seeing. One problem at a time :)

Edit

A few things that I think would improve this code:

  1. When using try-except blocks to catch exceptions/errors, there are a few rules that should almost always be followed:
  • Name specific exceptions & errors to handle, don't use unqualified except. The reason for this is that by catching every possible error, we actually suppress and obfuscate the source of bugs. The only legitimate reason to do this is to generate a custom error message, and the last line of the except-block should always be raise to allow the error to propagate. It goes against how we typically think of software errors, but when writing code, errors are your friend.
  • The try-except blocks are also problematic because they are being used as a conditional control structure. Sometimes it seems easier to code like this, but it is usually a sign of incomplete understanding of the libraries being used. I am specifically referring to the block that is checking for a video versus an image, although the other one could be refactored too. As a rule, when doing conditional branching, use an if statement.
  1. Using sleep with selenium is almost always incorrect, but it's by far the most common pitfall for new selenium users. What happens is that the developer will start getting errors about missing elements when trying to search the DOM. They will correctly conclude that it is because the page was not full loaded in the browser before selenium tried to read it. But using sleep is not the right approach because just waiting for a fixed time makes no guarantee that the page will be fully loaded. Selenium has a built-in mechanism to handle this, called explicit wait (along with implicit wait and fluent wait). Using an explicit wait will guarantee that the page element is visible before your code is allowed to proceed.

Upvotes: 1

Related Questions