spaceface
spaceface

Reputation: 85

Removing null bytes from a file results in larger output after XZ compression

I am developing a custom file format consisting of a large number of integers. I'm using XZ to compress the data as it has the best compression ratio of all compression algorithms I've tested.

All integers are stored as u32s in RAM, but they are all a maximum of 24 bits large, so I updated the serializer to skip the high byte (as it will always be 0 anyways), to try to compress the data further. However, this had the opposite effect: the compressed file is now larger than the original.

$ xzcat 24bit.xz | hexdump -e '3/1 "%02x" " "' 
020000 030000 552d07 79910c [...] b92c23 c82c23 

$ xzcat 32bit.xz | hexdump -e '4/1 "%02x" " "' 
02000000 03000000 552d0700 79910c00 [...] b92c2300 c82c2300

$ xz -l 24bit.xz 32bit.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1     82.4 MiB    174.7 MiB  0.472  CRC64   24bit.xz
    1       1     77.2 MiB    233.0 MiB  0.331  CRC64   32bit.xz
-------------------------------------------------------------------------------
    2       2    159.5 MiB    407.7 MiB  0.391  CRC64   2 files

Now, I wouldn't have an issue if the size of the file had remained the same, as a perfect compressor would detect that all of those bytes are redundant anyways, and compress them all down to practically nothing. However, I do not understand how removing data from the source file can possibly result in a larger file after compression?

I've tried changing the LZMA compressor settings, and xz --lzma2=preset=9e,lc=4,pb=0 yielded a slightly smaller file at 82.2M, but this is still significantly larger than the original file.


The order of the integers is somewhat important, so naively sorting the entire file won't work. The file is made up of different chunks, and the numbers making up each chunk are currently sorted for slightly better compression; however, the order does not matter, just the order of the chunks themselves.

Chunk 1:    000002 000003 072d55 0c9179 148884 1e414b
Chunk 2:    00489f 0050c5 0080a6 0082f0 0086f6 0086f7 01be81 03bdb1 03be85 03bf4e 04dfe6 04dfea 0583b1 061125 062006 067499 07d7e6 08074d 0858b8 09d35d 09de04 0cfd78 0d06be 0d3869 0d5534 0ec366 0f529c 0f6d0d 0fecce 107a7e 107ab3 13bc0b 13e160 15a4f9 15ab39 1771e3 17fe9c 18137d 197a30 1a087a 1a2007 1ab3b9 1b7d3c 1ba52c 1bc031 1bcb6b 1de7d2 1f0866 1f17b6 1f300e 1f39e1 1ff426 206c51 20abbe 20cbbc 211a58 211a59 215f73 224ea8 227e3f 227eab 22f3b7 231aef 004b15 004c86 0484e7 06216e 08074d 0858b8 0962ed 0eb020 0ec366 1a62c2 1fefae 224ea8 0a2701 1e414b 
Chunk 3:    000006 003b17 004b15 004b38 [...]

Upvotes: 3

Views: 280

Answers (0)

Related Questions