Reputation: 133
I tried to crawl the reviews in the websites. For 1 website, it runs fine. however when I create a loop to crawl in many websites, it throws an error raise
TimeoutException(message, screen, stacktrace) TimeoutException
I tried to increase the waiting time from 30 to 50 now but it still does not run fine. here is my code :
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime
start_time = datetime.now()
result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist()
link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="
for i in ids:
link3 = link + str(i[-17:].replace("/",""))
op = webdriver.ChromeOptions()
op.add_argument('--ignore-certificate-errors')
op.add_argument('--incognito')
op.add_argument('--headless')
driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
driver.get(i)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
soup = BeautifulSoup(driver.page_source, 'lxml')
product_attributes = requests.get(link3).json()
reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]
url = [i]*len(reviewtitle)
productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
content= [i.get_text().strip() for i in soup.find_all("div",attrs={"class":"review__body"})]
author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]
date = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]
output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
result.append(output)
result.to_excel(r'D:\bols.xlsx', index=False)
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
Here are some links that I tried to crawl :
Upvotes: 0
Views: 361
Reputation: 33384
I would suggest Use Infinite While loop
and use try..except
block. If element found it will click on the element else statement will go to the except block and exit from while loop.
driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("Lode more button found and clicked ")
except:
print("No more load more button available on the page.Please exit...")
break
Your console output will display like below.
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
No more load more button available on the page.Please exit...
Upvotes: 1
Reputation: 3753
As mentioned in the comments - your timing out because you're looking for a button that does not exist.
You need to catch the error(s) and skip those failling lines. You can do this with a try and except.
I've put together an example for you. It's hard coded to one url (as I don't have your data sheet) and it's a fixed loop with purpose to keep TRYING to click the "show more" button, even after it's gone.
With this solution be careful of your sync time. EACH TIME the WebDriverWait
is called it will wait that full duration if it does not exist. You'll need to exit the expand loop when done (first time you trip the error) and keep your sync time tight - or it will be a slow script
First, add these to your imports:
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
Then this will run and not error:
#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')
#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
for i in range(10):
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("I pressed load more")
except (TimeoutException, StaleElementReferenceException):
pass
print("No more to load - but i didn't fail")
The output to the console is this:
DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f
I pressed load more
I pressed load more
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail (and so on).
This is how my browser looks - Note the size of the scroll bar for the link I used - it looks like it's got all the reviews:
Upvotes: 1