Overriding HTTP errors with urllib2

Question

I have this code, but it is not working. I want to use urllib2 to iterate through a list of urls. Upon opening each url, BeautifulSoup locates a class and extracts that text. The program stalls if there is an invalid url in the list. If there is any error, I just want to have 'error' as the text, and for the program to continue on to the next url. Any ideas?

    for url in url_list:
         page=urllib2.urlopen(url)
         soup = BeautifulSoup(page.read())

         text = soup.find_all(class_='ProfileHeaderCard-locationText u-dir')
         if text is not None:
            for t in text:
                text2 = t.get_text().encode('utf-8')
         else:
            text2 = 'error'

Alex Martelli · Accepted Answer

try/except is your friend! Change your code to s/thing like...:

for url in url_list:
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError:
        text2 = 'error'
    else:
        soup = BeautifulSoup(page.read())
        text = soup.find_all(class_='ProfileHeaderCard-locationText u-dir')
        if text:
           for t in text:
               text2 = t.get_text().encode('utf-8')
        else:
           text2 = 'error'

Overriding HTTP errors with urllib2

Answers (2)

Related Questions