Reputation: 315
can anybody help me with this? I have written a code to scrape articles from a Chinese news site, using Selenium. As many of the urls do not load, I tried to include code to catch timeout exceptions, which works but then the browser seems to stay on page which timed out when loading, rather than moving to try the next url.
I've tried adding driver.quit() and driver.close() after handling the error, but then it doesn't work when continuing to the next loop.
with open('url_list_XB.txt', 'r') as f:
url_list = f.readlines()
for idx, url in enumerate(url_list):
status = str(idx)+" "+str(url)
print(status)
try:
driver.get(url)
try:
tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
for a in tblnks:
html = a.get_attribute('innerHTML')
try:
link = re.findall('href="http://comment(.+?)" title', str(html))[0]
tb_link = 'http://comment' + link
print(tb_link)
ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
print(ID)
with open('tb_links.txt', 'a') as p:
p.write(tb_link + '\n')
try:
text = str(driver.find_element_by_class_name("post_text").text)
headline = driver.find_element_by_tag_name('h1').text
date = driver.find_elements_by_class_name("post_time_source")
for a in date:
date = str(a.text)
dt = date.split(" 来源")[0]
dt2 = dt.replace(":", "_").replace("-", "_").replace(" ", "_")
count = driver.find_element_by_class_name("post_tie_top").text
with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
d.write(headline)
d.write(text + '\n')
path = 'SCS_DATA/' + ID
os.mkdir(path)
except NoSuchElementException as exception:
print("Element not found ")
except IndexError as g:
print("Index Error")
node = [url, tb_link]
results.append(node)
except NoSuchElementException as exception:
print("TB link not found ")
continue
except TimeoutException as ex:
print("Page load time out")
except WebDriverException:
print('WD Exception')
I want to the code to move through a list of urls, calling them and grabbing the article text as well as a link to the discussion page. It works until a page times out on loading, then the programme will not move on.
Upvotes: 1
Views: 4485
Reputation: 5909
I can't exactly understand what your code is doing because I have no context for the page you are automating, but I can provide a general structure for how you would accomplish something like this. Here's a simplified version of how I would handle your scenario:
# iterate URL list
for url in url_list:
# navigate to a URL
driver.get(url)
# check something here to test if a link is 'broken' or not
try:
driver.find_element(someLocator)
# if link is broken, go back
except TimeoutException:
driver.back()
# continue so we can return to beginning of loop
continue
# if you reach this point, the link is valid, and you can 'do stuff' on the page
This code navigates to the URL, and performs some check (that you specify) to see if the link is 'broken' or not. We check for broken link by catching the TimeoutException
that gets thrown. If the exception is thrown, we navigate to the previous page, then call continue
to return to the beginning of the loop, and start over with the next URL.
If we make it through the try
/ except
block, then the URL is valid and we are on the correct page. In this place, you can write your code to scrape the articles or whatever you need to do.
The code the appears after try
/ except
will ONLY be hit if TimeoutException
is NOT encountered -- meaning the URL is valid.
Upvotes: 2