Reputation: 1640
I'm using Python Twitter tools to download the latest 200 tweets of a large list of users. I'm getting a gzip error that occurs only intermittently. At seemingly random intervals, the loop will crash with the error stack below. If I immediately restart the loop and send that same user, I never have a problem downloading it. I've looked at the headers of the tweets when it crashes and there doesn't seem to be anything different from the headers that don't cause problems. And I've confirmed that plenty of the results that I get back without problem are also gzipped and are uncompressed fine.
Has anyone seen this issue before and/or can suggest a fix/workaround?
Here is the error stack, for what it's worth:
File "/Users/martinlbarron/Dropbox/Learning Python/downloadTimeline.py", line 33, in <module>
result=utility.downloadTimeline(kwargs,t)
File "/Users/martinlbarron/Dropbox/Learning Python/utility.py", line 73, in downloadTimeline
response=t.statuses.user_timeline(**kargs)
File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 173, in __call__
return self._handle_response(req, uri, arg_data)
File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 184, in _handle_response
data = f.read()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 245, in read
self._read(readsize)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 299, in _read
self._read_eof()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 338, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xf4196259 != 0x34967f68L
logout
Adding my code (be gentle, I'm a Python newbie)
I have a list of twitter names. I loop through them in the code below, calling my twitter download function (downloadTimeline).
t = Twitter(
auth=OAuth("XXX", "XXX",
"XXX", "XXX"))
for i in range(startRange,endRange):
#Get the id string for user
row=newlist[i]
sc=row[3]
kwargs = dict(count=200, include_rts=False, include_entities=False, trim_user=True, screen_name=sc)
result=utility.downloadTimeline(kwargs,t)
in downloadTimeline, I get the twitter response (response) and then parse it into a dictionary
def downloadTimeline(kargs, t):
#Get timeline
mylist = list()
counter=1000
try:
response=t.statuses.user_timeline(**kargs)
counter=response.rate_limit_remaining
#parse the file out
if len(response)>0:
for tweet in response:
user=tweet['user']
dict = {
'id_str': cleanLines(tweet['id_str']),
#ommitting the whole list of all the variables I save
}
mylist.append(dict)
except twitter.TwitterError as e:
print("Fail: %i" % e.e.code)
return (mylist, counter)
Finally, though it's not my code obviously, in Python Twitter tools framework, this is the bit of code that seems to be choking (specifically at f = gzip.GzipFile(fileobj=buf))
def _handle_response(self, req, uri, arg_data):
try:
handle = urllib_request.urlopen(req)
if handle.headers['Content-Type'] in ['image/jpeg', 'image/png']:
return handle
elif handle.info().get('Content-Encoding') == 'gzip':
# Handle gzip decompression
buf = StringIO(handle.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = handle.read()
if "json" == self.format:
res = json.loads(data.decode('utf8'))
return wrap_response(res, handle.headers)
else:
return wrap_response(
data.decode('utf8'), handle.headers)
except urllib_error.HTTPError as e:
if (e.code == 304):
return []
else:
raise TwitterHTTPError(e, uri, self.format, arg_data)
It turns out its pretty easy to turn accept gzip headers off in Python Twitter tools. But when I do that, I get the following error. I'm wondering if the response is getting truncated somehow:
File "/Users/martinlbarron/Dropbox/Learning Python/downloadTimeline.py", line 33, in <module>
result=utility.downloadTimeline(kwargs,t)
File "/Users/martinlbarron/Dropbox/Learning Python/utility.py", line 73, in downloadTimeline
response=t.statuses.user_timeline(**kargs)
File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 175, in __call__
return self._handle_response(req, uri, arg_data)
File "/Library/Python/2.7/site-packages/twitter-1.9.0-py2.7.egg/twitter/api.py", line 193, in _handle_response
res = json.loads(handle.read().decode('utf8'))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 13699 (char 13699)
logout
Upvotes: 2
Views: 4436
Reputation: 658
Instead of :
buf = StringIO(handle.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
Try this:
decomp = zlib.decompressobj(16+zlib.MAX_WBITS)
data = decomp.decompress(handle.read())
Don't forget to import zlib
Upvotes: 2