Reputation: 3919
I have many large 1GB+ matrices of doubles
(floats), many of them 0.0
, that need to be stored efficiently. I indend on keeping the double
type since some of the elements do require to be a double
(but I can consider changing this if it could lead to a significant space saving). A string header is optional. The matrices have no missing elements, NaNs, NAs, nulls, etc: they are all doubles
.
Some columns will be sparse, others will not be. The proportion of columns that are sparse will vary from file to file.
What is a space efficient alternative to CSV? For my use, I need to parse this matrix quickly into R
, python
and Java
, so a file format specific to a single language is not appropriate. Access may need to be by row or column.
I am also not looking for a commercial solution.
My main objective is to save HDD space without blowing out io
times. RAM usage once imported is not the primary consideration.
Upvotes: 1
Views: 941
Reputation: 46492
The most important question is if you always expand the whole matrix into memory or if you need a random access to the compacted form (and how). Expanding is way simpler, so I'm concentrating on this.
You could use a bitmap stating if a number is present or zero. This costs 1 bit per entry and thus can increase the file size by 1/64
in case of no zeros or shrink it to 1/64
in case of all zeros. If there are runs of zeros, you may store the number of following zeros and the number non-zeros instead, e.g., by packing two 4-bit numbers into one byte.
As the double
representation is standard, you can use binary representation in both languages. If many of your numbers are actually int
s, you may consider something like I did.
If consecutive numbers are related, you could consider storing their differences.
I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving).
Obviously, switching to float
would trade half precision for haltf the memory. This is probably too imprecise, so you could instead omit a few bits from the mantisa and get e.g. 6 bytes per entry. Alternatively, you could reduce the exponent to a single byte instead as the range 1e-38 to 3e38 should suffice.
Upvotes: 1