Mathieu Ouellet
Mathieu Ouellet

Reputation: 11

Problems with Selenium, driver keeps crashing

This is my first time posting on Stack Overflow and my first time using Selenium, so please be kind! :) I am trying to scrape the links of the movies off the IMDb search results page for a deep learning project. I need at the VERY least 50 000 links. I made a python program using BeautifulSoup that works very well. The only problem is the way the result page is made. Instead of being on multiple pages, there is a "50 more" button that dynamically adds 50 more movies to the DOM. So for my BeautifulSoup program to work, I would need to display enough movies, then save the modified HTML and pass it to BS. Since I don't want to click manually, I made a Selenium script that automatically clicks on the 50 more button a set number of times and saves the HTML. Problem is, it always crashes after 13 200 movies. I suspect that this is because the DOM becomes too loaded and the Selenium browser can't handle it. Is there anything I can do to make my program not crash or maybe a work around? I really need more than 13 200 links and I'm a total newbie to Selenium!

This is my Selenium code. Once again, if there are obvious newbie mistakes, I apologise.

mport time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
options = Options()
options.binary_location = "/usr/bin/chromium"
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument(f'user-agent={user_agent}')
service = Service(executable_path='/usr/bin/chromedriver')
driver = webdriver.Chrome(service=service, options=options)
driver.set_window_size(1920, 1080)


url = "https://www.imdb.com/search/title/?title_type=feature"
driver.get(url)
def click_load_more():
    try:
        # Wait for the button to be clickable
        wait = WebDriverWait(driver, 800000)
        load_more_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".ipc-see-more__button")))
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
        load_more_button.click()
        time.sleep(5) # Wait for the page to load
    except Exception as e: 
        print(f"Error: {e}")
        with open('errorFile.txt', 'w', encoding='utf-8') as errorFile:
            errorFile.write(e)
            


print("Starting. Hang on tight!!")
for x in range(800):
    print(x)
    click_load_more()

source = driver.page_source
with open('source.html', 'w', encoding='utf-8') as file:
    file.write(source)

I am running it in headless mode because the server I'm running it on doesn't have a gui, so no way of actually seeing what is going on.

Here is the error message in question.

error: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=119.0.6045.199)
Stacktrace:
#0 0x55e393f726f4 <unknown>
#1 0x55e393c771c7 <unknown>
#2 0x55e393c627f5 <unknown>
#3 0x55e393c615fa <unknown>
#4 0x55e393c61c0a <unknown>
#5 0x55e393c6f42b <unknown>
#6 0x55e393c825a5 <unknown>
#7 0x55e393cf60fa <unknown>
#8 0x55e393cdef13 <unknown>
#9 0x55e393cb1460 <unknown>
#10 0x55e393cb2ad3 <unknown>
#11 0x55e393f454e0 <unknown>
#12 0x55e393f487a0 <unknown>
#13 0x55e393f4821a <unknown>
#14 0x55e393f48cb5 <unknown>
#15 0x55e393f3767b <unknown>
#16 0x55e393f490a0 <unknown>
#17 0x55e393f22850 <unknown>
#18 0x55e393f62f07 <unknown>
#19 0x55e393f63115 <unknown>
#20 0x55e393f71cee <unknown>
#21 0x7f749ceaa9eb <unknown>

Traceback (most recent call last):
  File "/data/emo4858/Documents/projetFilm/autoclick.py", line 26, in click_load_more
    load_more_button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".ipc-see-more__button")))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/selenium/webdriver/support/wait.py", line 86, in until
    value = method(self._driver)
            ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/selenium/webdriver/support/expected_conditions.py", line 354, in _predicate
    target = driver.find_element(*target)  # grab element at locator
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 738, in find_element
    return self.execute(Command.FIND_ELEMENT, {"using": by, "value": value})["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 344, in execute
    self.error_handler.check_response(response)
  File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=119.0.6045.199)
Stacktrace:
#0 0x55e393f726f4 <unknown>
#1 0x55e393c771c7 <unknown>
#2 0x55e393c627f5 <unknown>
#3 0x55e393c615fa <unknown>
#4 0x55e393c61c0a <unknown>
#5 0x55e393c6f42b <unknown>
#6 0x55e393c825a5 <unknown>
#7 0x55e393cf60fa <unknown>
#8 0x55e393cdef13 <unknown>
#9 0x55e393cb1460 <unknown>
#10 0x55e393cb2ad3 <unknown>
#11 0x55e393f454e0 <unknown>
#12 0x55e393f487a0 <unknown>
#13 0x55e393f4821a <unknown>
#14 0x55e393f48cb5 <unknown>
#15 0x55e393f3767b <unknown>
#16 0x55e393f490a0 <unknown>
#17 0x55e393f22850 <unknown>
#18 0x55e393f62f07 <unknown>
#19 0x55e393f63115 <unknown>
#20 0x55e393f71cee <unknown>
#21 0x7f749ceaa9eb <unknown>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/emo4858/Documents/projetFilm/autoclick.py", line 41, in <module>
    click_load_more()
  File "/data/emo4858/Documents/projetFilm/autoclick.py", line 34, in click_load_more
    errorFile.write(e)
TypeError: write() argument must be str, not WebDriverException

Upvotes: 1

Views: 254

Answers (0)

Related Questions