David
David

Reputation:

What is the best compression algorithm for small 4 KB files?

I am trying to compress TCP packets each one of about 4 KB in size. The packets can contain any byte (from 0 to 255). All of the benchmarks on compression algorithms that I found were based on larger files. I did not find anything that compares the compression ratio of different algorithms on small files, which is what I need. I need it to be open source so it can be implemented on C++, so no RAR for example. What algorithm can be recommended for small files of about 4 kilobytes in size? LZMA? HACC? ZIP? gzip? bzip2?

Upvotes: 21

Views: 15841

Answers (11)

Yoon
Yoon

Reputation: 11

use zstd with library, it can give you higher compression ratio and speed especially for the small data ,https://github.com/facebook/zstd, this is from zstd readme:

The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms, and reason is, compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no "past" to build upon.

To solve this situation, Zstd offers a training mode, which can be used to tune the algorithm for a selected type of data. Training Zstandard is achieved by providing it with a few samples (one file per sample). The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression. Using this dictionary, the compression ratio achievable on small data improves dramatically.

supporting data you can check from sample data from github repo, i also test the data (my own data)by myself, it can be improved a lot when your data is 1kb, and decrease a lot compare the data without library when your data get bigger. check it here: github.com/facebook/zstd/issues/3603

Upvotes: -1

Matt Simerson
Matt Simerson

Reputation: 1125

Compression of ASCII messages scatter plot

This is a follow-up to Rick's excellent answer which I've upvoted. Unfortunately, I couldn't include an image in a comment.

I ran across this question and decided to try deflate on a sample of 500 ASCII messages that ranged in size from 6 to 340 bytes. Each message is a bit of data generated by an environmental monitoring system that gets transported via an expensive (pay-per-byte) satellite link.

The most fun observation is that the crossover point at which messages are smaller after compression is the same as the Ultimate Question of Life, the Universe, and Everything: 42 bytes.

To try this out on your own data, here's a little bit of node.js to help:

const zlib = require('zlib')
const sprintf = require('sprintf-js').sprintf
const inflate_len = data_packet.length
const deflate_len = zlib.deflateRawSync(data_packet).length
const delta = +((inflate_len - deflate_len)/-inflate_len * 100).toFixed(0)
console.log(`inflated,deflated,delta(%)`)
console.log(sprintf(`%03i,%03i,%3i`, inflate_len, deflate_len, delta))

Upvotes: 8

realdar
realdar

Reputation: 9

You may test bicom. This algorithm is forbidden for commercial use. If you want it for professional or commercial usage look at "range coding algorithm".

Upvotes: 0

KRoy
KRoy

Reputation: 1406

You can try delta compression. Compression will depend on your data. If you have any encapsulation on the payload, then you can compress the headers.

Upvotes: -1

David Cary
David Cary

Reputation: 5520

If you want to "compress TCP packets", you might consider using a RFC standard technique.

  • RFC1978 PPP Predictor Compression Protocol
  • RFC2394 IP Payload Compression Using DEFLATE
  • RFC2395 IP Payload Compression Using LZS
  • RFC3173 IP Payload Compression Protocol (IPComp)
  • RFC3051 IP Payload Compression Using ITU-T V.44 Packet Method
  • RFC5172 Negotiation for IPv6 Datagram Compression Using IPv6 Control Protocol
  • RFC5112 The Presence-Specific Static Dictionary for Signaling Compression (Sigcomp)
  • RFC3284 The VCDIFF Generic Differencing and Compression Data Format
  • RFC2118 Microsoft Point-To-Point Compression (MPPC) Protocol

There are probably other relevant RFCs I've overlooked.

Upvotes: 6

Rick C. Petty
Rick C. Petty

Reputation: 951

Choose the algorithm that is the quickest, since you probably care about doing this in real time. Generally for smaller blocks of data, the algorithms compress about the same (give or take a few bytes) mostly because the algorithms need to transmit the dictionary or Huffman trees in addition to the payload.

I highly recommend Deflate (used by zlib and Zip) for a number of reasons. The algorithm is quite fast, well tested, BSD licensed, and is the only compression required to be supported by Zip (as per the infozip Appnote). Aside from the basics, when it determines that the compression is larger than the decompressed size, there's a STORE mode which only adds 5 bytes for every block of data (max block is 64k bytes). Aside from the STORE mode, Deflate supports two different types of Huffman tables (or dictionaries): dynamic and fixed. A dynamic table means the Huffman tree is transmitted as part of the compressed data and is the most flexible (for varying types of nonrandom data). The advantage of a fixed table is that the table is known by all decoders and thus doesn't need to be contained in the compressed stream. The decompression (or Inflate) code is relatively easy. I've written both Java and Javascript versions based directly off of zlib and they perform rather well.

The other compression algorithms mentioned have their merits. I prefer Deflate because of its runtime performance on both the compression step and particularly in decompression step.

A point of clarification: Zip is not a compression type, it is a container. For doing packet compression, I would bypass Zip and just use the deflate/inflate APIs provided by zlib.

Upvotes: 18

MadCoder
MadCoder

Reputation: 1352

I've had luck using zlib compression libraries directly and not using any file containers. ZIP, RAR, have overhead to store things like filenames. I've seen compression this way yield positive results (compression less than original size) for packets down to 200 bytes.

Upvotes: 0

David
David

Reputation:

I did what Arno Setagaya suggested in his answer: made some sample tests and compared the results.

The compression tests were done using 5 files, each of them 4096 bytes in size. Each byte inside of these 5 files was generated randomly.

IMPORTANT: In real life, the data would not likely be all random, but would tend to have quiet a bit of repeating bytes. Thus in real life application the compression would tend to be a bit better then the following results.

NOTE: Each of the 5 files was compressed by itself (i.e. not together with the other 4 files, which would result in better compression). In the following results I just use the sum of the size of the 5 files together for simplicity.

I included RAR just for comparison reasons, even though it is not open source.

Results: (from best to worst)

LZOP: 20775 / 20480 * 100 = 101.44% of original size

RAR : 20825 / 20480 * 100 = 101.68% of original size

LZMA: 20827 / 20480 * 100 = 101.69% of original size

ZIP : 21020 / 20480 * 100 = 102.64% of original size

BZIP: 22899 / 20480 * 100 = 111.81% of original size

Conclusion: To my surprise ALL of the tested algorithms produced a larger size then the originals!!! I guess they are only good for compressing larger files, or files that have a lot of repeating bytes (not random data like the above). Thus I will not be using any type of compression on my TCP packets. Maybe this information will be useful to others who consider compressing small pieces of data.

EDIT: I forgot to mention that I used default options (flags) for each of the algorithms.

Upvotes: -5

Unknown
Unknown

Reputation: 46841

ZLIB should be fine. It is used in MCCP.

However, if you really need good compression, I would do an analysis of common patterns and include a dictionary of them in the client, which can yield even higher levels of compression.

Upvotes: 1

Mark Ransom
Mark Ransom

Reputation: 308538

I don't think the file size matters - if I remember correctly, the LZW in GIF resets its dictionary every 4K.

Upvotes: 1

SPWorley
SPWorley

Reputation: 11878

All of those algorithms are reasonable to try. As you say, they aren't optimized for tiny files, but your next step is to simply try them. It will likely take only 10 minutes to test-compress some typical packets and see what sizes result. (Try different compress flags too). From the resulting files you can likely pick out which tool works best.

The candidates you listed are all good first tries. You might also try bzip2.

Sometimes simple "try them all" is a good solution when the tests are easy to do.. thinking too much sometimes slow you down.

Upvotes: 2

Related Questions