mht
mht

Reputation: 133

Crawl data by Selenium but throws errors TimeoutException

I tried to crawl the reviews in the websites. For 1 website, it runs fine. however when I create a loop to crawl in many websites, it throws an error raise

TimeoutException(message, screen, stacktrace) TimeoutException

I tried to increase the waiting time from 30 to 50 now but it still does not run fine. here is my code :

import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime

start_time = datetime.now()

result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist() 

link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="

for i in ids:
    
    link3 = link + str(i[-17:].replace("/",""))
    op = webdriver.ChromeOptions()
    op.add_argument('--ignore-certificate-errors')
    op.add_argument('--incognito')
    op.add_argument('--headless')
    driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
    driver.get(i)
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()

    soup = BeautifulSoup(driver.page_source, 'lxml')

    product_attributes = requests.get(link3).json()

    reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]

    url = [i]*len(reviewtitle)

    productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
  
    content= [i.get_text().strip()  for i in soup.find_all("div",attrs={"class":"review__body"})]
    
    author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]

    date  = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]

    output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
    
    result.append(output)
    
    result.to_excel(r'D:\bols.xlsx', index=False)
    
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Here are some links that I tried to crawl :

link1 link2

Upvotes: 0

Views: 361

Answers (2)

KunduK
KunduK

Reputation: 33384

I would suggest Use Infinite While loop and use try..except block. If element found it will click on the element else statement will go to the except block and exit from while loop.

driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("Lode more button found and clicked ")
    except:
        print("No more load more button available on the page.Please exit...")
        break

Your console output will display like below.

Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
No more load more button available on the page.Please exit...

Upvotes: 1

RichEdwards
RichEdwards

Reputation: 3753

As mentioned in the comments - your timing out because you're looking for a button that does not exist.

You need to catch the error(s) and skip those failling lines. You can do this with a try and except.

I've put together an example for you. It's hard coded to one url (as I don't have your data sheet) and it's a fixed loop with purpose to keep TRYING to click the "show more" button, even after it's gone.

With this solution be careful of your sync time. EACH TIME the WebDriverWait is called it will wait that full duration if it does not exist. You'll need to exit the expand loop when done (first time you trip the error) and keep your sync time tight - or it will be a slow script

First, add these to your imports:

from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

Then this will run and not error:

#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')

#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
   
for i in range(10):
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("I pressed load more")
    except (TimeoutException, StaleElementReferenceException):
        pass
        print("No more to load - but i didn't fail")

The output to the console is this:

DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f

I pressed load more

I pressed load more

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail (and so on).

This is how my browser looks - Note the size of the scroll bar for the link I used - it looks like it's got all the reviews: enter image description here

Upvotes: 1

Related Questions