Reputation: 5531
I am writing a small code to download all images/videos in a post. Here is my code:
import urllib.request as reqq
from selenium import webdriver
import time
browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")
browser.get(url)
browser.maximize_window()
url_list = ['https://www.instagram.com/p/CE9CZmsghan/']
img_urls = []
vid_urls = []
img_url = ""
vid_url = ""
for x in url_list:
count = 0
browser.get(x)
while True:
try:
elements = browser.find_elements_by_class_name('_6CZji')
elements[0].click()
time.sleep(1)
except:
count+=1
time.sleep(1)
if count == 2:
break
try:
vid_url = browser.find_element_by_class_name('_5wCQW').find_element_by_tag_name('video').get_attribute('src')
vid_urls.append(vid_url)
except:
img_url = browser.find_element_by_class_name('KL4Bh').find_element_by_tag_name('img').get_attribute('src')
img_urls.append(img_url)
for x in range(len(img_urls)):
reqq.urlretrieve(img_urls[x],f"D:\\instaimg"+str(x+1)+".jpg")
for x in range(len(vid_urls)):
reqq.urlretrieve(vid_urls[x],"D:\\instavid"+str(x+1)+".mp4")
browser.close()
This code extracts all the images in the post except the last image. IMO, this code is right. Do you know why this code doesn't extract the last image? Any help would be appreciated. Thanks!
Upvotes: 1
Views: 1313
Reputation: 7978
Go to the URL that you're using in the example and open the inspector, and very carefully watch how the DOM changes as you click between images. There are multiple page elements with class KL4Bh
because it tracks the previous image, the current image, and the next image.
So doing find_element_by_class_name('KL4Bh')
returns the first match on the page.
Ok, lets break down this loop and see what is happening:
first iteration
page opens
immediately click 'next' to second photo
grab the first element for class 'KL4Bh' from the DOM
the first element for that class is the first image (now the 'previous' image)
[... 2, 3, 4 same as 1 ...]
fifth iteration
look for a "next" button to click
find no next button
`elements[0]` fails with index error
grab the first element for class 'KL4Bh' from the DOM
the first element for that class is **still the fourth image**
sixth iteration
look for a "next" button to click
find no next button
`elements[0]` fails with index error
error count exceeds threshold
exit loop
try something like this:
n = 0
while True:
try:
elements = browser.find_elements_by_class_name('_6CZji')
elements[0].click()
time.sleep(1)
except IndexError:
n=1
count+=1
time.sleep(1)
if count == 2:
break
try:
vid_url = browser.find_elements_by_class_name('_5wCQW')[n].find_element_by_tag_name('video').get_attribute('src')
vid_urls.append(vid_url)
except:
img_url = browser.find_elements_by_class_name('KL4Bh')[n].find_element_by_tag_name('img').get_attribute('src')
img_urls.append(img_url)
it will do the same thing as before, except since it's now using find_elements_by_class
and indexing into the resulting list, when it gets to the last image the index error for the failed button click will also cause the image lookup to increment the index it uses. So it will take the second element (the current image) on the last iteration of the loop.
There are still some serious problems with this code, but it does fix the bug you are seeing. One problem at a time :)
A few things that I think would improve this code:
try-except
blocks to catch exceptions/errors, there are a few rules that should almost always be followed:except
. The reason for this is that by catching every possible error, we actually suppress and obfuscate the source of bugs. The only legitimate reason to do this is to generate a custom error message, and the last line of the except
-block should always be raise
to allow the error to propagate. It goes against how we typically think of software errors, but when writing code, errors are your friend.try-except
blocks are also problematic because they are being used as a conditional control structure. Sometimes it seems easier to code like this, but it is usually a sign of incomplete understanding of the libraries being used. I am specifically referring to the block that is checking for a video versus an image, although the other one could be refactored too. As a rule, when doing conditional branching, use an if
statement.sleep
with selenium is almost always incorrect, but it's by far the most common pitfall for new selenium users. What happens is that the developer will start getting errors about missing elements when trying to search the DOM. They will correctly conclude that it is because the page was not full loaded in the browser before selenium tried to read it. But using sleep
is not the right approach because just waiting for a fixed time makes no guarantee that the page will be fully loaded. Selenium has a built-in mechanism to handle this, called explicit wait (along with implicit wait and fluent wait). Using an explicit wait will guarantee that the page element is visible before your code is allowed to proceed.Upvotes: 1