lm2s
lm2s

Reputation: 388

Python requests or urllib read timeout, URL encoding issue?

I'm trying to download a file from within Python, I've tried urllib and requests and both give me a timeout error. The file is at: http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf

Using requests:

r = requests.get('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf',timeout=60.0)

Using urllib:

urllib.urlretrieve('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf','the.pdf')

I've tried different URLs, such as:

And, I can download it using the browser and also with cURL using the following syntax:

curl http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf

So I'm suspecting it's an encoding issue, but I can't seem to get it to work. Any suggestions?

EDIT: Clarity.

Upvotes: 1

Views: 244

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121952

It looks like the server is responding differently depending on the client User-Agent. If you specify a custom User-Agent header the server responds with a PDF:

import requests
import shutil

url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
headers = {'User-Agent': 'curl'}  # wink-wink
response = requests.get(url, headers=headers, stream=True)

if response.status_code == 200:
    with open('result.pdf', 'wb') as output:
        response.raw.decode_content = True
        shutil.copyfileobj(response.raw, output)

Demo:

>>> import requests
>>> url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
>>> headers = {'User-Agent': 'curl'}  # wink-wink
>>> response = requests.get(url, headers=headers, stream=True)
>>> response.headers['content-type']
'application/pdf'
>>> response.headers['content-length']
'466191'
>>> response.raw.read(100)
'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(pt-PT) /StructTreeRoot 37 0 R/MarkInfo<</'

My guess is that someone abused a Python script once to download too many files from that server and you are being tar-pitted based on the User-Agent header alone.

Upvotes: 2

Related Questions