Shivam
Shivam

Reputation: 21

unable to scrape data, using requests and bs4

I have written a script which pulls data from an e-commerce website and I've used bs4 to scrape the contents of the page and requests to pull the data. Everything works just fine when i run the script locally on my machine. It takes 3-4 seconds to list out the data but yes, it works. Now when I deployed the script on Heroku, that's when the problem started. Even after pushing it to Heroku, script is working fine but a little slow and the most annoying part it is crashing very frequently. So it would scrape the data like 6-7 times and then it will throw a big chunk of error. Being a beginner, I'm not able to make anything out of it. Here is the full traceback log found from Heroku:

2020-09-11T18:39:48.896959+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.897027+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
2020-09-11T18:39:48.897328+00:00 app[worker.1]: conn = connection.create_connection(
2020-09-11T18:39:48.897333+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
2020-09-11T18:39:48.897547+00:00 app[worker.1]: raise err
2020-09-11T18:39:48.897569+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
2020-09-11T18:39:48.897793+00:00 app[worker.1]: sock.connect(sa)
2020-09-11T18:39:48.897834+00:00 app[worker.1]: OSError: [Errno 113] No route to host
2020-09-11T18:39:48.897835+00:00 app[worker.1]: 
2020-09-11T18:39:48.897891+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.897892+00:00 app[worker.1]: 
2020-09-11T18:39:48.897898+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.897898+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
2020-09-11T18:39:48.898299+00:00 app[worker.1]: httplib_response = self._make_request(
2020-09-11T18:39:48.898322+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 381, in _make_request
2020-09-11T18:39:48.898652+00:00 app[worker.1]: self._validate_conn(conn)
2020-09-11T18:39:48.898672+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
2020-09-11T18:39:48.899235+00:00 app[worker.1]: conn.connect()
2020-09-11T18:39:48.899238+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 309, in connect
2020-09-11T18:39:48.899483+00:00 app[worker.1]: conn = self._new_conn()
2020-09-11T18:39:48.899488+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
2020-09-11T18:39:48.899630+00:00 app[worker.1]: raise NewConnectionError(
2020-09-11T18:39:48.899656+00:00 app[worker.1]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host
2020-09-11T18:39:48.899658+00:00 app[worker.1]: 
2020-09-11T18:39:48.899658+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.899659+00:00 app[worker.1]: 
2020-09-11T18:39:48.899661+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.899678+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
2020-09-11T18:39:48.899896+00:00 app[worker.1]: resp = conn.urlopen(
2020-09-11T18:39:48.899899+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
2020-09-11T18:39:48.900165+00:00 app[worker.1]: retries = retries.increment(
2020-09-11T18:39:48.900180+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
2020-09-11T18:39:48.900369+00:00 app[worker.1]: raise MaxRetryError(_pool, url, error or ResponseError(cause))
2020-09-11T18:39:48.900409+00:00 app[worker.1]: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.flipkart.com', port=443): Max retries exceeded with url: /search?q=shoes&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host'))
2020-09-11T18:39:48.900411+00:00 app[worker.1]: 
2020-09-11T18:39:48.900411+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2020-09-11T18:39:48.900412+00:00 app[worker.1]: 
2020-09-11T18:39:48.900412+00:00 app[worker.1]: Traceback (most recent call last):
2020-09-11T18:39:48.900414+00:00 app[worker.1]: File "server.py", line 103, in <module>
2020-09-11T18:39:48.900542+00:00 app[worker.1]: reply= bot.flipkart(product= message_type)
2020-09-11T18:39:48.900567+00:00 app[worker.1]: File "/app/bot.py", line 86, in flipkart
2020-09-11T18:39:48.900823+00:00 app[worker.1]: datas= Test.scrape(product)
2020-09-11T18:39:48.900828+00:00 app[worker.1]: File "/app/Test.py", line 7, in __init__
2020-09-11T18:39:48.901017+00:00 app[worker.1]: self.source= requests.get('https://www.flipkart.com/search?q={}&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'.format(search_query)).content
2020-09-11T18:39:48.901049+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/api.py", line 76, in get
2020-09-11T18:39:48.901257+00:00 app[worker.1]: return request('get', url, params=params, **kwargs)
2020-09-11T18:39:48.901262+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/api.py", line 61, in request
2020-09-11T18:39:48.901466+00:00 app[worker.1]: return session.request(method=method, url=url, **kwargs)
2020-09-11T18:39:48.901471+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
2020-09-11T18:39:48.901887+00:00 app[worker.1]: resp = self.send(prep, **send_kwargs)
2020-09-11T18:39:48.901891+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
2020-09-11T18:39:48.902410+00:00 app[worker.1]: r = adapter.send(request, **kwargs)
2020-09-11T18:39:48.902413+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
2020-09-11T18:39:48.902823+00:00 app[worker.1]: raise ConnectionError(e, request=request)
2020-09-11T18:39:48.902882+00:00 app[worker.1]: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.flipkart.com', port=443): Max retries exceeded with url: /search?q=shoes&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd5906c0250>: Failed to establish a new connection: [Errno 113] No route to host'))
2020-09-11T18:39:48.991351+00:00 heroku[worker.1]: Process exited with status 1
2020-09-11T18:39:49.047690+00:00 heroku[worker.1]: State changed from up to crashed

I apologize for not sharing the whole code. I would have shared it but I have linked two or three files together so it won't be possible to share the whole code here. I tried so hard but unable to understand the error, so any help would be much appreciated!

Upvotes: 0

Views: 197

Answers (1)

joel.t.mathew
joel.t.mathew

Reputation: 124

The error you showed was caused by no internet or the internet speed was slow. Try checking whether there is proper internet if that doesnt work retstart your current python environment

Upvotes: 1

Related Questions