Selenium scrolling and scraping with BeautifulSoup produces duplicate results

Question

I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src links after requests is being looped.

Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?

import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver


url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)

scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:
')

def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    soup = BeautifulSoup(driver.page_source, 'lxml')
    imgs = soup.find_all('img', class_='_2di5p')
    for img in imgs:
        img_url = img["src"]
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    if new_height == last_height:
        break
    last_height = new_height

Update: So I placed this part of the code outside of while True and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.

soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
    #tn = datetime.now().strftime('%H:%M:%S')
    img_url = img["src"]
    print('=> [+] img_{}'.format(counter))
    screens(counter)
    counter = counter + 1

Mihai Chelaru · Accepted Answer

The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens(). You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:

import requests
import selenium.webdriver as webdriver
import time

driver = webdriver.Firefox()

url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)

scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:
')

def screens(get_name):
    with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

old_imgs = set()

while True:

    imgs = driver.find_elements_by_class_name('_2di5p')

    imgs_dedupe = set(imgs) - set(old_imgs)

    for img in imgs_dedupe:
        img_url = img.get_attribute("src")
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    old_imgs = imgs

    if new_height == last_height:
        break
    last_height = new_height

driver.quit()

As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.

Selenium scrolling and scraping with BeautifulSoup produces duplicate results

Answers (2)

Related Questions