migrate from urllib2 to pycurl

Question

I have a snippet of code, shown below, that uses urllib2 .. I'm trying to convert it to pycurl to benefit from the pycurl proxy support. The converted code of pycurl is shown after the original code.. I want to know how to change the urllib.urlopen(req).read() to something similar in pycurl .. maybe using something like strinIO?

urllib2 code:

URL = 'URL'
UN = 'UN'
PWD = 'PWD'
HEADERS = { 'Accept': 'application/json',
            'Connection': 'Keep-Alive',
            'Accept-Encoding' : 'gzip',
            'Authorization' : 'Basic %s' % base64.encodestring('%s:%s' % (UN, PWD))  }
req = urllib2.Request(URL, headers=HEADERS)
    response = urllib2.urlopen(req, timeout=(KEEP_ALIVE))
    # header -  print response.info()
    decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
    remainder = ''
    while True:
        tmp = decompressor.decompress(response.read(CHUNKSIZE))

the pycurl conversion with proxy support:

URL = 'URL'
UN = 'UN'
PWD = 'PWD'
HEADERS = [ 'Accept : application/json',
            'Connection : Keep-Alive',
            'Accept-Encoding : gzip',
            'Authorization : Basic %s' % base64.encodestring('%s:%s' % (UN, PWD))  ]
req = pycurl.Curl()
    req.setopt(pycurl.CONNECTTIMEOUT,KEEP_ALIVE)
    req.setopt(pycurl.HTTPHEADER, HEADERS)
    req.setopt(pycurl.TIMEOUT, 1+KEEP_ALIVE)
    req.setopt(pycurl.PROXY, 'http://my-proxy')
    req.setopt(pycurl.PROXYPORT, 8080)
    req.setopt(pycurl.PROXYUSERPWD, "proxy_access_user : proxy_access_password")
    req.setopt(pycurl.URL , URL)
    response = req.perform()
    decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
    remainder = ''
    while True:
        tmp = decompressor.decompress(urllib2.urlopen(req).read(CHUNKSIZE))

thanks in advance.

abarnert · Accepted Answer

Unlike urllib2, which returns an object that you can use to get the data, curl needs you to pass it an object that it can use to store the data.

The simple way to do this, used in most of the examples, is to pass a file object as the WRITEDATA option. You might think you could just pass a StringIO here, like this:

# ...
s = StringIO.StringIO()
req.setopt(pycurl.WRITEDATA, s)
req.perform()
data = s.getvalue()

Unfortunately, that won't work, as the file object has to be a real file (or at least something with a C-level file descriptor), and a StringIO doesn't qualify.

You could of course use a NamedTemporaryFile, but if you'd prefer to keep the file in memory—or, better, not store it on memory or on disk, but just process it on the fly—that won't help.

The solution is to use the WRITEFUNCTION option instead:

s = StringIO.StringIO()
req.setopt(pycurl.WRITEFUNCTION, s.write)
req.perform()
data = s.getvalue()

As you can see, you can use a StringIO for this if you want—in fact, that's exactly what the curl object documentation from pycurl does—but it's not really simplifying things too much over any other way of accumulating strings (like putting them in a list and ''.join-ing them, or even just concatenating them onto a string).

Note that I linked to the C-level libcurl docs, not the pycurl docs, because pycurl's documentation basically just says "FOO does the same thing as CURLOPT_FOO" (even when there are differences, like the fact that your WRITEFUNCTION doesn't get the size, nmemb, and userdata parameters).

What if you want to stream the data on the fly? Just use a WRITEFUNCTION that accumulates and processes it on the fly. You won't be writing a loop yourself, but curl will be looping internally and driving the process. For example:

z = zlib.decompressobj()
s = []
def handle(chunk):
    s.append(z.decompress(chunk))
    return len(chunk)
req.setopt(pycurl.WRITEFUNCTION, handle)
req.perform()
s.append(z.flush())
data = ''.join(s)

curl will call your function once for each chunk of data it retrieves, so the entire loop happens inside that req.perform() call. (It may also call it again with 0 bytes at the end, so make sure you callback function can handle that. I think z.decompress can, but you might want to verify that.)

There are ways to limit the size of each write, to abort the download in the middle, to get the header as part of the write instead of separately, etc., but usually you won't need to touch those.

migrate from urllib2 to pycurl

Answers (1)

Related Questions