Reputation: 53
I have a client that wants to web scrape this sketchy website and the loops works the first time, then the error occurs. Any help? I suggest not visiting the website, but hopefully the pays worth my time lol.
options = webdriver.ChromeOptions()
options.add_argument("--incognito")
PATH = 'C:\Program Files (x86)\chromedriver.exe'
URL = 'https://avbebe.com/archives/category/高清中字/page/5'
driver = webdriver.Chrome(executable_path=PATH, options=options)
driver.get(URL)
time.sleep(5)
Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
actions = ActionChains(driver)
time.sleep(5)
WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))#where error occurs
actions.double_click(title).perform()
time.sleep(5)
VidUrl = driver.current_url
VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text
try:
VidTags = driver.find_elements_by_class_name('tags')
for tag in VidTags:
VidTag = tag.find_element_by_tag_name('a').text
except NoSuchElementException or StaleElementReferenceException:
pass
with open('data.csv', 'w', newline='', encoding = "utf-8") as f:
fieldnames = ['Title', 'Tags', 'URL']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})
driver.back()
driver.refresh()
print('done')
Error:
WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a'))) File "C:\Users\Heage\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\support\wait.py",
line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message:
Upvotes: 1
Views: 266
Reputation: 854
You are nearly there, just missing a few pieces.
Firstly, you are fetching all the links to videos, and then navigating in a loop.
Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
# ...
WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))
# ...
driver.back()
driver.refresh()
What happens is that once the browser navigates to a different url, all of those elements become stale, i.e. they will throw an error when you try to click them as the browser no longer has a connection to the original elements.
So what you need to do is to read all the available links into a list and just access them using driver.get without the need to refresh the page
link_elements = driver.find_elements_by_class_name('entry-title a')
links = {link_element.get_attribute('href') for link_element in link_elements}
for link in links:
driver.get(link) # otherwise, stale elements
Next, once you open the page, you are searching for an element with an id.
VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text
However, you have to keep in mind that ids change from page to page, so your script is likely to fail here.
Instead, try to find classes that don't change. I took a look at the page and found that the video title has an tag with a entry-title
class, so I used that instead
VidTitle = driver.find_element_by_css_selector('h1.entry-title').text
Working solution
options = Options()
options.add_argument("--incognito")
driver = webdriver.Chrome(options=options)
URL = 'https://avbebe.com/archives/category/高清中字/page/5'
driver.get(URL)
link_elements = driver.find_elements_by_class_name('entry-title a')
links = {link_element.get_attribute('href') for link_element in link_elements}
for link in links:
driver.get(link)
VidUrl = driver.current_url
VidTitle = driver.find_element_by_css_selector('h1.entry-title').text
try:
VidTags = driver.find_elements_by_class_name('tags')
for tag in VidTags:
VidTag = tag.find_element_by_tag_name('a').text
except NoSuchElementException or StaleElementReferenceException:
pass
with open('data.csv', 'w', newline='', encoding="utf-8") as f:
fieldnames = ['Title', 'Tags', 'URL']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})
print('done')
Upvotes: 3
Reputation: 3790
Put the line driver.get(URL)
inside the loop. Remove driver.back()
and driver.refresh()
.
options = webdriver.ChromeOptions()
options.add_argument("--incognito")
PATH = 'C:\Program Files (x86)\chromedriver.exe'
URL = 'https://avbebe.com/archives/category/高清中字/page/5'
driver = webdriver.Chrome(executable_path=PATH, options=options)
driver.get(URL)
time.sleep(5)
Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
driver.get(URL)
actions = ActionChains(driver)
time.sleep(5)
WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))#where error occurs
actions.double_click(title).perform()
time.sleep(5)
VidUrl = driver.current_url
VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text
try:
VidTags = driver.find_elements_by_class_name('tags')
for tag in VidTags:
VidTag = tag.find_element_by_tag_name('a').text
except NoSuchElementException or StaleElementReferenceException:
pass
with open('data.csv', 'w', newline='', encoding = "utf-8") as f:
fieldnames = ['Title', 'Tags', 'URL']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})
#driver.back()
#driver.refresh()
print('done')
Upvotes: 0