Reputation: 633
There doesn't appear to be a way to add headers to the urlparse command. This essentially causes Python to use its default user agent, which is blocked by several web pages. What I am trying to do is essentially do the equivalent of this:
req = Request(INPUT_URL,headers={'User-Agent':'Browser Agent'})
But using urlparse:
parsed = list(urlparse(INPUT_URL))
So how can I modify this urlparse in order for it to take headers, or be usable with my Request that I created? Any help is appreciated, thanks.
Also, for anyone wondering the exact error I am getting:
urllib.error.HTTPError: HTTP Error 403: Forbidden
At this:
urlretrieve(urlunparse(parsed),outpath)
Upvotes: 0
Views: 994
Reputation: 1121564
Headers are part of a request, of which the URL is one part. Python creates a request for you when you pass in just a URL to urllib.request
functions.
Create a Request
object, add the headers to that object and use that instead of a string URL:
request = Request(urlunparse(parsed), headers={'User-Agent': 'My own agent string'})
However, urlretrieve()
is marked as 'legacy API' in the code and doesn't support using a Request
object. Removing a few lines supporting 'file://' urls is easy enough:
import contextlib
import tempfile
from urllib.error import ContentTooShortError
from urllib.request import urlopen
_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
with contextlib.closing(urlopen(url, data)) as fp:
headers = fp.info()
# Handle temporary file setup.
if filename:
tfp = open(filename, 'wb')
else:
tfp = tempfile.NamedTemporaryFile(delete=False)
filename = tfp.name
_url_tempfiles.append(filename)
with tfp:
result = filename, headers
bs = 1024*8
size = -1
read = 0
blocknum = 0
if "content-length" in headers:
size = int(headers["Content-Length"])
if reporthook:
reporthook(blocknum, bs, size)
while True:
block = fp.read(bs)
if not block:
break
read += len(block)
tfp.write(block)
blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)
if size >= 0 and read < size:
raise ContentTooShortError(
"retrieval incomplete: got only %i out of %i bytes"
% (read, size), result)
return result
Upvotes: 1