user6596353
user6596353

Reputation: 65

Compressing a list of integers in Python

I have a list of positive (random) integers with the following properties:

Number of elements: 78495

Maximum value of element: 999982

Length of list when converted to a string: 517115 (string looks like "6,79384,238956,...")

Size of list in text file on disk: 520 kb

I am trying to use this list as a precomputed list for an online judge problem because it takes a long time to actually generate this list. However, it is too large to be accepted if I paste it directly into the source code, which has a cap of 50 kb.

I looked into zlib as a way to compress the string but it only seemed to cut the size in half.

Is there a way to really shrink this down so I can unpack it / use it in the source code?

Upvotes: 2

Views: 12318

Answers (5)

Marco
Marco

Reputation: 9097

There are python bindings for Streamvbyte.

Streamvbyte is a fast, fairly good integer compression library, used by a number of well-known open source projects.

Upvotes: 0

Pavel Tolmachev
Pavel Tolmachev

Reputation: 61

Here I've tested easily availiable algorithms in python on two strings: one is generated randomly with non uniform distribution, another one has some structure. It seems, lzma does better

# check the compression ratio
import lzma
import zlib
import gzip
import bz2
import zipfile
import tarfile
compressors = ['lzma','zlib','gzip','bz2'] 
a = np.exp(np.random.rand(1024))
b = np.arange(1024)
b[32] = -10
b[96] = 20000
a = bytes(a)
b = bytes(b)

for i in range(len(compressors)):
    print("{} compression ratio: ".format(compressors[i]))
    a_lzma = eval(compressors[i]).compress(a)
    b_lzma = eval(compressors[i]).compress(b)
    print(float(len(a_lzma))/len(a),float(len(b_lzma))/len(b))
    print("\n")

The output:

lzma compression ratio: 0.93115234375 0.08984375

zlib compression ratio: 0.95068359375 0.1944580078125

gzip compression ratio: 0.9521484375 0.196533203125

bz2 compression ratio: 0.9925537109375 0.1268310546875

Upvotes: 1

Arnauld
Arnauld

Reputation: 6110

Given your definition ...

it is a list of smallest-k values for which 10^k = 1 mod p for primes p > 5

... am I wrong to believe that your values are of the form (p - 1) / x where x is an integer significantly smaller than p?

For instance, for p < 50, we have:

p = 7  : 10^6  = 1 (mod 7)  => k = 6  = (p - 1) / 1  => x = 1
p = 11 : 10^2  = 1 (mod 11) => k = 2  = (p - 1) / 5  => x = 5
p = 13 : 10^6  = 1 (mod 13) => k = 6  = (p - 1) / 2  => x = 2
p = 17 : 10^16 = 1 (mod 17) => k = 16 = (p - 1) / 1  => x = 1
p = 19 : 10^18 = 1 (mod 19) => k = 18 = (p - 1) / 1  => x = 1
p = 23 : 10^22 = 1 (mod 23) => k = 22 = (p - 1) / 1  => x = 1
p = 29 : 10^28 = 1 (mod 29) => k = 28 = (p - 1) / 1  => x = 1
p = 31 : 10^15 = 1 (mod 31) => k = 15 = (p - 1) / 2  => x = 2
p = 37 : 10^3  = 1 (mod 37) => k = 3  = (p - 1) / 12 => x = 12
p = 41 : 10^5  = 1 (mod 41) => k = 5  = (p - 1) / 8  => x = 8
p = 43 : 10^21 = 1 (mod 43) => k = 21 = (p - 1) / 2  => x = 2
p = 47 : 10^46 = 1 (mod 47) => k = 46 = (p - 1) / 1  => x = 1

The list of x values should compress much better than the list of k values. (For instance, I'd be willing to bet that the most frequent value of x will be '1'.)

And because it's rather easy and fast to compute primes up to 1 million (which I think is your upper bound), you may be able to quickly rebuild the list of k values based on the compressed list of x values and the real-time computed list of primes.

You probably should have explained from the beginning what exactly you were trying to compress to get more accurate answers.

Upvotes: 5

Alex Reynolds
Alex Reynolds

Reputation: 96937

The best text compression available offers a (roughly) 12-17% compression ratio (62.4-90 kB) so you're not going to meet your threshold. Your data are random, as well, which generally makes compression algorithms perform worse.

Look at an alternative approach, such as making your RNG process faster, or if you don't need a full list (just some integers), create a separate "producer" thread to generate random integers (involving whatever actual math you are using) and a "consumer" thread that does work on those integers as they come in. That way, your program could perhaps still do work, even if it would take a long time to generate a full list.

Upvotes: 1

user94559
user94559

Reputation: 60143

In short, no.

log(2, 999982) ~= 20

So the largest number would take 20 bits to store. Let's say that on average, each number takes 10 bits to store (evenly distributed between 0 and the max).

~80,000 numbers * 10 bits per number = 800,000 bits = 100,000 bytes

So these numbers, stored as efficiently as possible, would take ~100KB of space.

Compression will only work if there's some non-randomness to the numbers. If they're truly random, as you say, then a general compression algorithm won't be able to make this any smaller, so 100KB is about the best you can hope to do.

EDIT

Note that things are even worse, in that you want to paste these into source code, so you can't just use arbitrary binary data. You'll need something text-friendly, like base64 encoding, which will add another ~33% of overhead. Also, you can't really store numbers based on the average number of bits required, because you'd need some way to know how many bits were used by each individual number. There are possible encoding schemes, but all will carry some additional overhead.

SECOND EDIT

Based on the comments above, the data is not actually random as originally stated. A general compression algorithm therefore might work, and if not, there are presumably other solutions (e.g. just shipping the code that generated the numbers in the first place, which is likely smaller than 50KB).

Upvotes: 2

Related Questions