NAME__
NAME__

Reputation: 633

Python Adding Headers to urlparse

There doesn't appear to be a way to add headers to the urlparse command. This essentially causes Python to use its default user agent, which is blocked by several web pages. What I am trying to do is essentially do the equivalent of this:

req = Request(INPUT_URL,headers={'User-Agent':'Browser Agent'})

But using urlparse:

parsed = list(urlparse(INPUT_URL))

So how can I modify this urlparse in order for it to take headers, or be usable with my Request that I created? Any help is appreciated, thanks.

Also, for anyone wondering the exact error I am getting:

urllib.error.HTTPError: HTTP Error 403: Forbidden

At this:

urlretrieve(urlunparse(parsed),outpath)

Upvotes: 0

Views: 994

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121564

Headers are part of a request, of which the URL is one part. Python creates a request for you when you pass in just a URL to urllib.request functions.

Create a Request object, add the headers to that object and use that instead of a string URL:

request = Request(urlunparse(parsed), headers={'User-Agent': 'My own agent string'})

However, urlretrieve() is marked as 'legacy API' in the code and doesn't support using a Request object. Removing a few lines supporting 'file://' urls is easy enough:

import contextlib
import tempfile
from urllib.error import ContentTooShortError

    from urllib.request import urlopen

_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
    with contextlib.closing(urlopen(url, data)) as fp:
        headers = fp.info()

        # Handle temporary file setup.
        if filename:
            tfp = open(filename, 'wb')
        else:
            tfp = tempfile.NamedTemporaryFile(delete=False)
            filename = tfp.name
            _url_tempfiles.append(filename)

        with tfp:
            result = filename, headers
            bs = 1024*8
            size = -1
            read = 0
            blocknum = 0
            if "content-length" in headers:
                size = int(headers["Content-Length"])

            if reporthook:
                reporthook(blocknum, bs, size)

            while True:
                block = fp.read(bs)
                if not block:
                    break
                read += len(block)
                tfp.write(block)
                blocknum += 1
                if reporthook:
                    reporthook(blocknum, bs, size)

    if size >= 0 and read < size:
        raise ContentTooShortError(
            "retrieval incomplete: got only %i out of %i bytes"
            % (read, size), result)

    return result

Upvotes: 1

Related Questions