P_n
P_n

Reputation: 992

Selenium scrolling and scraping with BeautifulSoup produces duplicate results

I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src links after requests is being looped.

Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?

import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver


url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)

scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    soup = BeautifulSoup(driver.page_source, 'lxml')
    imgs = soup.find_all('img', class_='_2di5p')
    for img in imgs:
        img_url = img["src"]
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    if new_height == last_height:
        break
    last_height = new_height


Update: So I placed this part of the code outside of while True and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.

soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
    #tn = datetime.now().strftime('%H:%M:%S')
    img_url = img["src"]
    print('=> [+] img_{}'.format(counter))
    screens(counter)
    counter = counter + 1

Upvotes: 1

Views: 1294

Answers (2)

Mihai Chelaru
Mihai Chelaru

Reputation: 8262

The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens(). You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:

import requests
import selenium.webdriver as webdriver
import time

driver = webdriver.Firefox()

url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)

scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

old_imgs = set()

while True:

    imgs = driver.find_elements_by_class_name('_2di5p')

    imgs_dedupe = set(imgs) - set(old_imgs)

    for img in imgs_dedupe:
        img_url = img.get_attribute("src")
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    old_imgs = imgs

    if new_height == last_height:
        break
    last_height = new_height

driver.quit()

As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.

Upvotes: 2

Biarys
Biarys

Reputation: 1183

I would use os library to check if file already exists

import os


def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        if os.path.isfile(path/to/the/file):      #checks file exists. Gives false on directory
    # or if os.path.exists(path/to/the/file): #checks file/directory exists
            pass
        else:
            r = requests.get(img_url)
            f.write(r.content)

*I might have messed up the ordering of if and with statements

Upvotes: -1

Related Questions