Nick Olczak
Nick Olczak

Reputation: 315

Handling timeout with Selenium and Python

can anybody help me with this? I have written a code to scrape articles from a Chinese news site, using Selenium. As many of the urls do not load, I tried to include code to catch timeout exceptions, which works but then the browser seems to stay on page which timed out when loading, rather than moving to try the next url.

I've tried adding driver.quit() and driver.close() after handling the error, but then it doesn't work when continuing to the next loop.

with open('url_list_XB.txt', 'r') as f:
    url_list = f.readlines()

for idx, url in enumerate(url_list):
    status = str(idx)+" "+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title', str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt', 'a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split(" 来源")[0]
                            dt2 = dt.replace(":", "_").replace("-", "_").replace(" ", "_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url, tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

I want to the code to move through a list of urls, calling them and grabbing the article text as well as a link to the discussion page. It works until a page times out on loading, then the programme will not move on.

Upvotes: 1

Views: 4485

Answers (1)

CEH
CEH

Reputation: 5909

I can't exactly understand what your code is doing because I have no context for the page you are automating, but I can provide a general structure for how you would accomplish something like this. Here's a simplified version of how I would handle your scenario:

# iterate URL list
for url in url_list:

    # navigate to a URL
    driver.get(url)

    # check something here to test if a link is 'broken' or not
    try: 
        driver.find_element(someLocator)

    # if link is broken, go back
    except TimeoutException:
        driver.back()
        # continue so we can return to beginning of loop
        continue

    # if you reach this point, the link is valid, and you can 'do stuff' on the page

This code navigates to the URL, and performs some check (that you specify) to see if the link is 'broken' or not. We check for broken link by catching the TimeoutException that gets thrown. If the exception is thrown, we navigate to the previous page, then call continue to return to the beginning of the loop, and start over with the next URL.

If we make it through the try / except block, then the URL is valid and we are on the correct page. In this place, you can write your code to scrape the articles or whatever you need to do.

The code the appears after try / except will ONLY be hit if TimeoutException is NOT encountered -- meaning the URL is valid.

Upvotes: 2

Related Questions