Reputation: 121
Assume I have a sensor returing measurement data (e.g. 6 x 50 values per second). Each value is a decimal within the range of single-precision float. I need to write these data to a file which is then read by another application for other operations. What would be the most-efficient way to format/encode the file?
At first I thought CSV for simplicity, but then using scientific notation would result in each decimal having length 9 bytes or more (e.g. -4,97E-03). This can be a problem for storage constraint in case of long data retention over lots of sensors, also because these data have high entropy, so compression doesn't help much.
So I was considering that saving decimals as float (4 bytes) would save a lot of data, but I don't know which formats provide well-defined structures for storing a table of floats. Is there something like a comma-separated-values where values are in IEEE754 format, or something similar? I'm asking this because I'd like to avoid defining a custom format.
Upvotes: 1
Views: 52
Reputation: 3256
As you said, encoding floats in a human-readable text format like CSV is pretty space inefficient, since each 32-bit float costs a dozen or so characters to encode. As a test, I generated 1 million random 32-bit floats and saved them as a text file:
-5.92667373e+04
-1.10473797e+05
7.58996562e+04
3.52729886e+04
...
The size of this file is 15,499,059 bytes. However, such a text file compresses very well! After running the file through gzip, the file size reduces to 5,925,628 bytes. This is not too bad, about 1.5x the size it would cost to store the floats as binary data (4,000,000 bytes).
32-bit floats represent roughly 7 significant digits of precision, but this might be excessive to represent the measurements, particularly if the measurements are known to be less precise than this. With a text format, you can save space by printing fewer significant digits. Or if writing binary data, you can reduce to half the cost by rounding down to 16-bit half-precision floats or a 16-bit fixed-point representation.
As Eric commented, you could write the raw binary data directly, along with array dimensions or whatever else you need, and come up with your own ad hoc format. But if you would rather use an existing standard format, here are a couple suggestions:
The NPY format is natively supported by the Python numpy library, and can represent arrays of floats or just about anything else you can put in a numpy array. If you are working in Python already, NPY files are easily read and written with np.load and np.save. There are also implementations of NPY in other languages, for instance https://github.com/rogersce/cnpy in C++ and https://docs.rs/npy/0.4.0/npy in Rust.
The FITS format is widely used in astronomy. FITS stores arrays of float data of any size and number of dimensions using a simple binary encoding, and it can store arbitrary metadata in a text header field. The format is intentionally simple, so implementing your own reader and writer is relatively straightforward.
Upvotes: 0