Reputation: 404
I have made a web crawler that takes thousands of Urls from a text file and then crawls the data on that webpage.
Now that it has many Urls; some Urls are broken too.
So it gives me the error:
Traceback (most recent call last):
File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 57, in <module>
crawl_data("http://www.foasdasdasdasdodily.com/r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show")
File "C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 18, in crawl_data
data = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\site-packages\requests\adapters.py", line 437, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.foasdasdasdasdodily.com', port=80): Max retries exceeded with url: /r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0310FCB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
Here's my code:
def crawl_data(url):
global connectString
data = requests.get(url)
response = str( data )
if response != "<Response [200]>":
return
soup = BeautifulSoup(data.text,"lxml")
titledb = soup.h1.string
But it still gives me the same exception or error.
I simply want it to ignore that Urls from which there is no response and move on to the next Url.
Upvotes: 1
Views: 2890
Reputation: 37113
You need to learn about exception handling. The easiest way to ignore these errors is to surround the code that processes a single URL with a try-except
construct, making you code read something like:
try:
<process a single URL>
except requests.exceptions.ConnectionError:
pass
This will mean that if the specified exception occurs your program will just execute the pass
(do nothing) statement and move on to the next
Upvotes: 3
Reputation: 26748
Use try-except
:
def crawl_data(url):
global connectString
try:
data = requests.get(url)
except requests.exceptions.ConnectionError:
return
response = str( data )
soup = BeautifulSoup(data.text,"lxml")
titledb = soup.h1.string
Upvotes: 2