Google dataflow only partly uncompressing files compressed with pbzip2

Question

seq 1 1000000 > testfile

bzip2 -kz9 testfile
mv testfile.bz2 testfile-bzip2.bz2

pbzip2 -kzb9 testfile
mv testfile.bz2 testfile-pbzip2.bz2

gsutil cp testfile gs://[bucket]
gsutil cp testfile-bzip2.bz2 gs://[bucket]
gsutil cp testfile-pbzip2.bz2 gs://[bucket]

Then I run the following pipeline on the two zipped files.

        p.apply(TextIO.read().from(filePath).withCompressionType(TextIO.CompressionType.BZIP2))
         .apply(TextIO.
                write().
                to(filePath.substring(0, filePath.length() - 4)).
                withoutSharding());

Which results in the following state of my bucket:

As you can see the uncompressed file compressed by pbzip2 is too small to have been correctly uncompressed. It seems only the first block has been uncompressed and the rest discarded.

pbzip2 version:

Parallel BZIP2 v1.1.12 [Dec 21, 2014]

bzip2 version:

bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.

I am using version 2.0.0 of the dataflow SDK.

I have a lot of files compressed with pbzip2, and I would prefer to not change the way they are compressed.

Any suggestions on how to get around this? Is this even suppose to work with files zipped with pbzip2?

rf- · Accepted Answer

This is a bug in how the BZIP2 library is invoked to read PBZIP2-generated files. The fix is in review as I type this. See BEAM-2708.

Google dataflow only partly uncompressing files compressed with pbzip2

Answers (1)

Related Questions