Reputation: 829
Ok, so I have a zip file that contains gz files (unix gzip).
Here's what I do --
def parseSTS(file):
import zipfile, re, io, gzip
with zipfile.ZipFile(file, 'r') as zfile:
for name in zfile.namelist():
if re.search(r'\.gz$', name) != None:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
This gives the following result --
>>>
start for file XXXXXX.gz
done opening
done reading
Then stays like that forever until it crashes ...
What can I do with filecontent
?
Edit : this is not a duplicate since my gzipped files are in a zipped file and i'm trying to avoid extracting that zip file to disk. It works with zip files in a zip file as per How to read from a zip file within zip file in Python? .
Upvotes: 2
Views: 3139
Reputation: 1225
I created a zip file containing a gzip'ed PDF file I grabbed from the web.
I ran this code (with two small changes):
1) Fixed indenting of everything under the def statement (which I also corrected in your Question because I'm sure that it's right on your end or it wouldn't get to the problem you have).
2) I changed:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
to:
print("start for file ", name)
with gzip.open(name,'rb') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
Because you were passing a file object to gzip.open instead of a string. I have no idea how your code is executing without that change, but it was crashing for me until I fixed it.
EDIT: Adding link to GZIP docs from James R's answer --
Also, see here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage
END EDIT
Now, since my gzip'ed file is small, the behavior I observe is that is pauses for about 3 seconds after printing done reading
, then outputs what is in filecontent
.
I would suggest adding the following debugging line after your print "done reading" -- print len(filecontent)
. If this number is very, very large, consider not printing the entire file contents in one shot.
I would also suggest reading this for more insight into what I expect is your problem: Why is printing to stdout so slow? Can it be sped up?
EDIT 2 - an alternative if your system does not handle file io on zip files, causing no such file errors in the above:
def parseSTS(afile):
import zipfile
import zlib
import gzip
import io
with zipfile.ZipFile(afile, 'r') as archive:
for name in archive.namelist():
if name.endswith('.gz'):
bfn = archive.read(name)
bfi = io.BytesIO(bfn)
g = gzip.GzipFile(fileobj=bfi,mode='rb')
qqq = g.read()
print qqq
parseSTS('t.zip')
Upvotes: 1
Reputation: 4656
Most likely your problem lies here:
if name.endswith(".gz"): #as goncalopp said in the comments, use endswith
#zfiledata = zfile.open(name) #don't do this
#print("start for file ", name)
with gzip.open(name,'rb') as gzfile: #gz compressed files should be read in binary and gzip opens the files directly
#print("done opening") #trust in your program, luke
filecontent = gzfile.read()
#print("done reading")
print(filecontent)
See here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage
Upvotes: 0