Reputation: 287
I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:
Collection
is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data()
(which I have a previous question to thank for) with all my url's in urllist
. I have something called HTTPError
but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).
Collection=[]
def get_url_data(url):
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except HTTPError:
return None
site = bs4.BeautifulSoup(r.text)
groups=site.select('div.filters')
word=url.split("/")[-1]
B=[]
for x in groups:
B.append(word)
T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
B.append(T)
B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
B.append([a.get_text() for a in x.select('div.blah [class=text]')])
Collection.append(B)
B=[]
for url in urllist:
get_url_data(url)
I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred
.
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
Upvotes: 0
Views: 5996
Reputation: 9647
You can make your try-catch
block look like this,
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except Exception:
return
The Exception
class will handle all the errors and exception.
If you want to get the exception message you can print this in your except
block. You have then instantiate exception first before raising it.
except Exception as e:
print(e.message)
return
Upvotes: 5