Reputation: 2845
I am running Python 3.6.4 on Windows 10 with Fall Creators update. I am attempting to decompress a Wikimedia data dump file, specifically https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-meta-current.xml.bz2.
This file decompresses without problems using 7z
on the command line but fails on the first block of data with zero length output from the Python decompressor. The code follows:
import bz2
def decompression(qin, # Iterable supplying input bytes data
qout): # Pipe to next process - needs bytes data
decomp = bz2.BZ2Decompressor() # Create a decompressor
for chunk in qin: # Loop obtaining data from source iterable
lc = len(chunk) # = 16384
dc = decomp.decompress(chunk) # Do the decompression
ldc = len(dc) # = 0
qout.put(dc) # Pass the decompressed chunk to the next process
I have verified that the bz2 header is valid and since the file decompresses without problems using command line utilities, the problem seems to be related to the Python implementation of BZ2. The following values from the decompressor seem OK and match what you would expect given the documentation.
eof = False
unused_data = b''
needs_input = True
Any suggestions on how to troubleshoot this problem?
Upvotes: 4
Views: 1773
Reputation: 112269
Beats me. I can't find anything wrong with your function. The following works on the linked .bz2 file with no issue, where the output exactly matches the result of a command-line decompression of that .bz2 file:
import sys
import bz2
def decompression(qin, # Iterable supplying input bytes data
qout): # Pipe to next process - needs bytes data
decomp = bz2.BZ2Decompressor() # Create a decompressor
for chunk in qin: # Loop obtaining data from source iterable
lc = len(chunk) # = 16384
dc = decomp.decompress(chunk) # Do the decompression
# qout.put(dc) # Pass the decompressed chunk to the next process
qout.write(dc)
with open('enwiktionary-latest-pages-meta-current.xml.bz2', 'rb') as f:
it = iter(lambda: f.read(16384), b'')
decompression(it, sys.stdout.buffer)
I only made one trivial change to your function in order to write the result to stdout. I am using Python 3.6.4. I also tried it with Python 2.7.10 (removing the .buffer
), and it again worked flawlessly.
Are you actually just letting your function run? What do you mean by "fails on the first block"? The first few calls (seven in this case) will in fact return no decompressed data, because you have not yet provided a complete block for it to work on. But there are no errors reported.
Note: to do this right for .bz2 files that contain concatenated bzip2 streams, you would need to loop on eof
true, creating a new decompressor object and feeding in the unused_data
from the previous decompressor object, followed by more data read from the compressed file. The linked file isn't one of those.
Upvotes: 2