SIM
SIM

Reputation: 22440

Can't reach the bottom of a webpage

I've written a script in python with selenium to handle the infinite scrolling webpage. The problem I'm facing is that It scrolls few times then quits the browser. It never reaches the bottom. I tried with Explicit Wait as well but that gives even fewer scrolling. How can I reach the bottom when there will be no more scrolling to do.

This is my try:

import time
from selenium import webdriver
from urllib.parse import urljoin

url = "https://www.instagram.com/explore/tags/travelphotoawards/"

driver = webdriver.Chrome()
driver.get(url)

last_len = len(driver.find_elements_by_css_selector(".v1Nh3 a"))
new_len = last_len

while True:
    last_len = new_len
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(5)

    items = driver.find_elements_by_css_selector(".v1Nh3 a")
    new_len = len(items)
    if last_len == new_len:break

driver.quit()

Edit:

If I try like below, I can do the scrolling as many times as I want but that is not a good idea to cope with:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://www.instagram.com/explore/tags/travelphotoawards/"

driver = webdriver.Chrome()
driver.get(url)

for scroll in range(1,10):  #I can do the scrolling as many times as I want but it is fully hardcoded
    item = driver.find_element_by_tag_name("body")
    item.send_keys(Keys.END)
    elems = driver.find_elements_by_css_selector(".v1Nh3 a")
    time.sleep(3)

driver.quit()

I hope there is any way to do the scrolling automatically until it reaches the bottom.

Upvotes: 0

Views: 410

Answers (2)

Tarun Lalwani
Tarun Lalwani

Reputation: 146520

So few thing here. In case of infinite scrolling I would follow few things

  • Disable images so that the scrolling is faster
  • Never trust a condition to be true if it is not consistent. Test it for it continuously for a period and if the condition is consistent then trust it
  • Try to not scroll way too long, infinite scrolling can cause browser to clog too much memory and sometimes even crash
  • Dump data in batches after every scroll. So on first page load, I would dump all page date. Then every scroll, I would just dump the delta part. This can be easily done using an xpath.

Below is a updated script which will do better for you. Do remember nothing is perfect, so you need to make your script adapt to failures

import time
from selenium import webdriver
from urllib.parse import urljoin

option = webdriver.ChromeOptions()
chrome_prefs = {}
option.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}


driver = webdriver.Chrome(chrome_options=option)

url = "https://www.instagram.com/explore/tags/travelphotoawards/"

driver.get(url)

last_len = len(driver.find_elements_by_css_selector(".v1Nh3 a"))
new_len = last_len

consistent = 0
while True:
    last_len = new_len
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(5)
    items = driver.find_elements_by_css_selector(".v1Nh3 a")
    new_len = len(items)
    if last_len == new_len:
        consistent += 1
        if consistent == 3:
            break
    else:
        consistent = 0

driver.quit()

Upvotes: 3

Guy
Guy

Reputation: 50854

Every time there is a scroll older images disappear. You might get the same number or even smaller number of images after the scroll.

Each image has unique href, you can compare the last image href to the previous last image

last_href = driver.find_elements_by_css_selector('.v1Nh3 > a')[-1].get_attribute('href')
new_href = last_href

while True:
    last_href = new_href
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(5)

    new_href = driver.find_elements_by_css_selector('.v1Nh3 > a')[-1].get_attribute('href')

    if last_href != new_href:
        break

Upvotes: 2

Related Questions