Martin
Martin

Reputation: 1640

Python Twitter tools and gzip error "IOError: CRC check failed"

I'm using Python Twitter tools to download the latest 200 tweets of a large list of users. I'm getting a gzip error that occurs only intermittently. At seemingly random intervals, the loop will crash with the error stack below. If I immediately restart the loop and send that same user, I never have a problem downloading it. I've looked at the headers of the tweets when it crashes and there doesn't seem to be anything different from the headers that don't cause problems. And I've confirmed that plenty of the results that I get back without problem are also gzipped and are uncompressed fine.

Has anyone seen this issue before and/or can suggest a fix/workaround?

Here is the error stack, for what it's worth:

File "/Users/martinlbarron/Dropbox/Learning Python/downloadTimeline.py", line 33, in <module>
    result=utility.downloadTimeline(kwargs,t)
  File "/Users/martinlbarron/Dropbox/Learning Python/utility.py", line 73, in downloadTimeline
    response=t.statuses.user_timeline(**kargs)
  File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 173, in __call__
    return self._handle_response(req, uri, arg_data)
  File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 184, in _handle_response
    data = f.read()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 245, in read
    self._read(readsize)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 299, in _read
    self._read_eof()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 338, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xf4196259 != 0x34967f68L
logout

Adding my code (be gentle, I'm a Python newbie)

I have a list of twitter names. I loop through them in the code below, calling my twitter download function (downloadTimeline).

t = Twitter(
    auth=OAuth("XXX", "XXX",
               "XXX", "XXX"))
for i in range(startRange,endRange):
    #Get the id string for user
    row=newlist[i]
    sc=row[3]
    kwargs = dict(count=200, include_rts=False, include_entities=False, trim_user=True, screen_name=sc)
    result=utility.downloadTimeline(kwargs,t)

in downloadTimeline, I get the twitter response (response) and then parse it into a dictionary

def downloadTimeline(kargs, t):

    #Get timeline
    mylist = list()
    counter=1000
    try:
        response=t.statuses.user_timeline(**kargs)
        counter=response.rate_limit_remaining
        #parse the file out
        if len(response)>0:
            for tweet in response:
                user=tweet['user']
                dict =  {
                    'id_str': cleanLines(tweet['id_str']), 
                    #ommitting the whole list of all the variables I save
                }
                mylist.append(dict)

    except twitter.TwitterError as e:
            print("Fail: %i" % e.e.code)

    return  (mylist, counter)

Finally, though it's not my code obviously, in Python Twitter tools framework, this is the bit of code that seems to be choking (specifically at f = gzip.GzipFile(fileobj=buf))

   def _handle_response(self, req, uri, arg_data):
        try:
            handle = urllib_request.urlopen(req)
            if handle.headers['Content-Type'] in ['image/jpeg', 'image/png']:
                return handle
            elif handle.info().get('Content-Encoding') == 'gzip':
                # Handle gzip decompression
                buf = StringIO(handle.read())
                f = gzip.GzipFile(fileobj=buf)
                data = f.read()
            else:
                data = handle.read()

            if "json" == self.format:
                res = json.loads(data.decode('utf8'))
                return wrap_response(res, handle.headers)
            else:
                return wrap_response(
                    data.decode('utf8'), handle.headers)
        except urllib_error.HTTPError as e:
            if (e.code == 304):
                return []
            else:
                raise TwitterHTTPError(e, uri, self.format, arg_data)

It turns out its pretty easy to turn accept gzip headers off in Python Twitter tools. But when I do that, I get the following error. I'm wondering if the response is getting truncated somehow:

  File "/Users/martinlbarron/Dropbox/Learning Python/downloadTimeline.py", line 33, in <module>
    result=utility.downloadTimeline(kwargs,t)
  File "/Users/martinlbarron/Dropbox/Learning Python/utility.py", line 73, in downloadTimeline
    response=t.statuses.user_timeline(**kargs)
  File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 175, in __call__
    return self._handle_response(req, uri, arg_data)
  File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 193, in _handle_response
    res = json.loads(handle.read().decode('utf8'))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 13699 (char 13699)
logout

Upvotes: 2

Views: 4436

Answers (1)

Fabio Cabral
Fabio Cabral

Reputation: 658

Instead of :

buf = StringIO(handle.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()

Try this:

decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(handle.read())

Don't forget to import zlib

Upvotes: 2

Related Questions