Reputation: 388
I'm trying to download a file from within Python, I've tried urllib and requests and both give me a timeout error. The file is at: http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf
Using requests:
r = requests.get('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf',timeout=60.0)
Using urllib:
urllib.urlretrieve('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf','the.pdf')
I've tried different URLs, such as:
And, I can download it using the browser and also with cURL using the following syntax:
curl http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf
So I'm suspecting it's an encoding issue, but I can't seem to get it to work. Any suggestions?
EDIT: Clarity.
Upvotes: 1
Views: 244
Reputation: 1121952
It looks like the server is responding differently depending on the client User-Agent. If you specify a custom User-Agent
header the server responds with a PDF:
import requests
import shutil
url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
headers = {'User-Agent': 'curl'} # wink-wink
response = requests.get(url, headers=headers, stream=True)
if response.status_code == 200:
with open('result.pdf', 'wb') as output:
response.raw.decode_content = True
shutil.copyfileobj(response.raw, output)
Demo:
>>> import requests
>>> url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
>>> headers = {'User-Agent': 'curl'} # wink-wink
>>> response = requests.get(url, headers=headers, stream=True)
>>> response.headers['content-type']
'application/pdf'
>>> response.headers['content-length']
'466191'
>>> response.raw.read(100)
'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(pt-PT) /StructTreeRoot 37 0 R/MarkInfo<</'
My guess is that someone abused a Python script once to download too many files from that server and you are being tar-pitted based on the User-Agent header alone.
Upvotes: 2