Reputation: 170
I am trying to make a script that scrape presentation from a slideshare link and download it as a PDF.
The script is working fine, until the total slides are under 20. Is there any alternative to requests
in python that can do the job.
Here is the scripts:
import requests
from bs4 import BeautifulSoup
from PIL import Image
import io
URL_LESS = "https://www.slideshare.net/angelucmex/global-warming-2373190?qid=8f04572c-48df-4f53-b2b0-0eb71021931c&v=&b=&from_search=1"
URL="https://www.slideshare.net/tusharpanda88/python-basics-59573634?qid=03cb80ee-36f0-4241-a516-454ad64808a8&v=&b=&from_search=5"
r = requests.get(URL_LESS)
soup = BeautifulSoup(r.content, "html5lib")
imgs = soup.find_all('img', class_="slide-image")
imgSRC = [x.get("srcset").split(',')[0].strip().split(' ')[0].split('?')[0] for x in imgs]
imagesJPG = []
for img in imgSRC:
im = requests.get(img)
f = io.BytesIO(im.content)
imgJPG = Image.open(f)
imagesJPG.append(imgJPG)
imagesJPG[0].save(f"{soup.title.string}.pdf",save_all=True, append_images=imagesJPG[1:])
Try changing URL_LESS to URL, you will get the idea.
Here is the traceback
Traceback (most recent call last):
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
raise err
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 1040, in _validate_conn
conn.connect()
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 358, in connect
conn = self._new_conn()
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 440, in send
resp = conn.urlopen(
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen
retries = retries.increment(
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\Work\py\scrapingScripts\slideshare\main.py", line 16, in <module>
im = requests.get(img)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did
not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
Upvotes: 1
Views: 609
Reputation: 52
The script worked perfectly for me both when using URL and URL_LESS, so your internet might be the culprit here.
My guesses are:
You're having a slow/inconsistent internet.
Slideshare is blacklisting your IP/ web-agent maybe for DDOS protection.(unlikely)
You're Using ipv6, which has been the culprit in these kind of cases for me, try switching your network to use ipv4 only.
and when it comes to requests, I have personally used it to scrape a fairly large amount of data for a fairly long time so I can say it's an amazing library to use
Upvotes: 2