Reputation: 3049
I am crawling the web using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.
I found this question - Link - and it suggests using urllib
and urlopen
. I don't want to contact the server twice.
I want to limit the file size to 25MB.
Is there a way i can do this with urllib3
?
Upvotes: 3
Views: 4737
Reputation: 18157
If the server supplies a Content-Length
header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.
To do this, you'll need to make sure that you're not preloading the full response.
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url, preload_content=False)
# Maximum amount we want to read
max_bytes = 1000000
content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
# Expected body is smaller than our maximum, read the whole thing
data = response.read()
# Do something with data
...
elif content_bytes is None:
# Alternatively, stream until we hit our limit
amount_read = 0
for chunk in r.stream():
amount_read += len(chunk)
# Save chunk
...
if amount_read > max_bytes:
break
# Release the connection back into the pool
response.release_conn()
Upvotes: 7