Reputation: 183
seq 1 1000000 > testfile
bzip2 -kz9 testfile
mv testfile.bz2 testfile-bzip2.bz2
pbzip2 -kzb9 testfile
mv testfile.bz2 testfile-pbzip2.bz2
gsutil cp testfile gs://[bucket]
gsutil cp testfile-bzip2.bz2 gs://[bucket]
gsutil cp testfile-pbzip2.bz2 gs://[bucket]
Then I run the following pipeline on the two zipped files.
p.apply(TextIO.read().from(filePath).withCompressionType(TextIO.CompressionType.BZIP2))
.apply(TextIO.
write().
to(filePath.substring(0, filePath.length() - 4)).
withoutSharding());
Which results in the following state of my bucket:
As you can see the uncompressed file compressed by pbzip2 is too small to have been correctly uncompressed. It seems only the first block has been uncompressed and the rest discarded.
pbzip2 version:
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
bzip2 version:
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
I am using version 2.0.0 of the dataflow SDK.
I have a lot of files compressed with pbzip2, and I would prefer to not change the way they are compressed.
Any suggestions on how to get around this? Is this even suppose to work with files zipped with pbzip2?
Upvotes: 0
Views: 112