Adam G
Adam G

Reputation: 77

Java Deflater for large set of random strings

I am using the Deflater class to try to compress a large set of random strings. My compression and decompression methods look like this:

public static String compressAndEncodeBase64(String text) {
        try {
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
                dos.write(text.getBytes());
            }
            byte[] bytes = os.toByteArray();

            return new String(Base64.getEncoder().encode(bytes));
        } catch (Exception e){
            log.info("Caught exception when trying to compress {}: ", text, e);
        }
        return null;
    }

public static String decompressB64(String compressedAndEncodedText) {
    try {
        byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);

        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try (OutputStream ios = new InflaterOutputStream(os)) {
            ios.write(decodedText);
        }
        byte[] decompressedBArray = os.toByteArray();
        return new String(decompressedBArray, StandardCharsets.UTF_8);
    } catch (Exception e){
        log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
        throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
    }
}

However, when I test this on a large set of random strings, my "compressed" string is larger than the original string. Even for a relatively small random string, the compressed data is longer. For example, this unit test fails:

@Test
    public void testCompressDecompressRandomString(){
        String orig = RandomStringUtils.random(71, true, true);
        String compressedString = compressAndEncodeBase64(orig.toString());
        Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
    }

Anyone can explain what's going on and a possible alternative?

Note: I tried using the deflater without the base64 encoding:

public static String compress(String data)  {
        Deflater new_deflater = new Deflater();
        new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
        new_deflater.finish();
        byte compressed_string[] = new byte[1024];
        int compressed_size = new_deflater.deflate(compressed_string);
        byte[] returnValues = new byte[compressed_size];
        System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
        log.info("The Original String: " + data + "\n Size: " + data.length());
        log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
        return new String(returnValues, StandardCharsets.UTF_8);
    }

My test still fails however.

Upvotes: 0

Views: 349

Answers (1)

Mark Adler
Mark Adler

Reputation: 112502

First off, you aren't going to get much or any compression on short strings. Compressors need more data to both collect statistics on the data and to have previous data in which to look for repeated strings.

Second, if you're testing with random data, you are further crippling the compressor, since now there are no repeated strings. For your test case with random alphanumeric strings, the only compression you can get is to take advantage of the fact that there are only 62 possible values for each byte. That can be compressed by a factor of log(62)/log(256) = 0.744. Even then, you need to have enough input to cancel the overhead of the code description. Your test case of 71 characters will always be compressed to 73 bytes by deflate, which is essentially just copying the data with a small overhead. There isn't enough input to justify the code description to take advantage of the limited character set. If I have 1,000,000 random characters from that set of 62, then deflate can compress that to about 752,000 bytes.

Third, you are then expanding the resulting compressed data by a factor of 1.333 by encoding it using Base64. So if I take that compression by a factor of 0.752 and then expand it by 1.333, I get an overall expansion of 1.002! You won't get anywhere that way on random characters from a set of 62, no matter how long the input is.

Given all that, you need to do your testing on real-world inputs. I suspect that your application does not have randomly-generated data. Don't attempt compression on short strings. Combine your strings into much longer input, so that the compressor has something to work with. If you must encode with Base64, then you must. But expect that there may be expansion instead of compression. You could include in your output format an option for chunks to be compressed or not compressed, indicated by a leading byte. Then when compressing, if it doesn't compress, send it without compression instead. You can also try a more efficient encoding, e.g. Base85, or whatever number of characters you can transmit transparently.

Upvotes: 2

Related Questions