mh8020
mh8020

Reputation: 1844

Compression Formats and Delimiter Sequences

My question is: Are there any standard compression formats which can ensure that a certain delimiter sequence does not occur in the compressed data stream?

We want to design a binary file format, containing chunks of sequential data (3D coordinates + other data, not really important for the question). Each chunk should be compressed using a standard compression format, like GZIP, ZIP, ...

So, the file structure will be like:

FileHeader
ChunkDelimiter Chunk1_Header compress(Chunk1_Data)
ChunkDelimiter Chunk2_Header compress(Chunk2_Data)
...

Use case is the following: The files should be read in splits in Hadoop, so we want to be able to start at an arbitrary byte position in the file, and find the start of the next chunk by looking for the delimiter sequence. -> The delimiter sequence should not occur within the chunks.

I know that we could post-process the compressed data, "escaping" the delimiter sequence in case that it occurs in the compressed output. But we'd better avoid this, since the "reverse escaping" would be required in the decoder, adding complexity.

Some more facts why we chose this file format:

Upvotes: 2

Views: 472

Answers (2)

Mark Adler
Mark Adler

Reputation: 112284

No. I know of no standard compression format that does not allow any sequence of bits to occur somewhere within. To do otherwise would (slightly) degrade compression, going against the original purpose of a compression format.

The solutions are a) post-process the sequence to use a specified break pattern and insert escapes if the break pattern accidentally appears in the compressed data -- this is guaranteed to work, but you don't like this solution, or b) trust that the universe is not conspiring against you and use a long break pattern whose length assures that it is incredibly unlikely to appear accidentally in all the sequences this is applied to anytime from now until the heat death of the universe.

For b) you can protect somewhat against the universe conspiring against you by selecting a random pattern for each file, and providing the random pattern at the start of the file. For the truly paranoid, you could go even further and generate a new random pattern for each successive break, from the previous pattern.

Note that the universe can conspire against you for a fixed pattern. If you make one of these compressed files with a fixed break pattern, and then you include that file in another compressed archive also using that break pattern, that archive will likely not be able to compress this already compressed file and will simply store it, leaving exposed the same fixed break pattern as is being used by the archive.

Another protection for b) would be to detect the decompression failure of an incorrect break by seeing that the piece before the break does not terminate, and handle that special case by putting that piece and the following piece back together and trying the decompression again. You would also very likely detect this on the following piece as well, with that decompression failing.

Upvotes: 1

Clément MATHIEU
Clément MATHIEU

Reputation: 3161

I won't answer your question with a compression scheme name but will give you a hint of how other solved the same issue.

Let's give a look at Avro. Basically, they have similar requirements: files must be splitable and each data block can be compressed (you can even choose your compression scheme).

From the Avro Specification we learn that splittability is achieved with the help of a synchronization marker ("Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing."). We also discover that the synchronization marker is a 16-byte randomly-generated value ("The 16-byte, randomly-generated sync marker for this file.").

How does it solve your issue ? Well, since Martin Kleppmann provided a great answer to this question a few years ago I will just copy paste his message

On 23 January 2013 21:09, Josh Spiegel wrote:

As I understand it, Avro container files contain synchronization markers every so often to support splitting the file. See: https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F

(1) Why isn't the synchronization marker the same for every container file? (i.e. what is the point of generating it randomly every time)

(2) Is it possible, at least in theory, for naturally occurring data to contain bytes that match the sync marker? If so, would this break synchronization?

Thanks, Josh

  1. Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating what the sync marker is, is downloaded by a web crawler and stored in an Avro data file; then the sync marker will appear in the actual data). Data may come from malicious sources; making the marker random makes it unfeasible to exploit.

  2. Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data is about 10^-23. It's more likely that your data center is wiped out by a meteorite (http://preshing.com/20110504/hash-collision-probabilities).

  3. If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just read over it sequentially, nothing happens.

Martin

Link to the Avro mailing list archive

If it works for Avro, it will work for you too.

Upvotes: 2

Related Questions