thatandrey
thatandrey

Reputation: 287

Python web scraping, skip url if error

I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:

Collection is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data() (which I have a previous question to thank for) with all my url's in urllist. I have something called HTTPError but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).

Collection=[]
def get_url_data(url):

    try:
        r = requests.get(url, timeout=10)
        r.raise_for_status()

    except HTTPError:
        return None

    site = bs4.BeautifulSoup(r.text)
    groups=site.select('div.filters')
    word=url.split("/")[-1]

    B=[]
    for x in groups:
        B.append(word)
        T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
        A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
        if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
            B.append(T)
            B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
            B.append([a.get_text() for a in x.select('div.blah [class=text]')])
            Collection.append(B)
        B=[]

for url in urllist:
    get_url_data(url)

I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred.

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

Upvotes: 0

Views: 5996

Answers (1)

salmanwahed
salmanwahed

Reputation: 9647

You can make your try-catch block look like this,

try:
    r = requests.get(url, timeout=10)
    r.raise_for_status()

except Exception:
    return

The Exception class will handle all the errors and exception.

If you want to get the exception message you can print this in your except block. You have then instantiate exception first before raising it.

except Exception as e:
    print(e.message)
    return

Upvotes: 5

Related Questions