Spifff
Spifff

Reputation: 748

Storing integer arrays in a MySQL database

I have a lot of spectra that I want to store in a database. A spectrum is basically an array of integers with in my case a variable length of typically 512 or 1024. How best to store these spectra? Along with the spectra I want to store some additional data like time and a label, which will be simple fields in my database. The spectra will not be retrieved often and if I need them, I need them as a whole.

For storing the spectra I can think of 2 possible solutions:

Any suggestions on which one to use? Other solutions are much appreciated of course!

Upvotes: 3

Views: 8776

Answers (2)

Fábio Lobão
Fábio Lobão

Reputation: 301

If you have a small dataset (hundreds of MB), there is no problem in using an SQL DBMS with any of the alternatives.

As proposed by Maciej, serialization is an improvement over the other alternative, such as you can group each spectrum sweep into a single tuple (row in a table), reducing the overhead in keys and other information.

For the serialization, you may consider using objects such as linestring or multipoint in order to be able to better process the data using SQL functions. This will require some scaling but will allow querying the data and if you use WKB you may also achieve a relevant gain in storage use with little loss in the performance.

The problem is that spectrum data tends to accumulate and storage usage may become a problem that will not be easily solved by the serialization trick. You should carefully consider this in your project.

Working on a similar problem, I came to the conclusion that it is a bad idea to use any SQL DMBS (MySQL, SQL Server, Postgre, and such) to manage large numerical matrix data, such as spectrum sweep measurements. It's a bit like trying to create an image library CMS by storing images pixel by pixel into a database.

The following table presents a comparison between a few formats in my experiment. This may help to understand the problem of using SQL DBMS to store numerical data matrixes.


    MySQL Table​      Table with key - Int(10) - and value - decimal(4,1)    1 157 627 904 B​
    TXT              CSV decimal(4,1), equivalent to 14bit                    276 895 606 B​
    BIN (original)   Matrix 1 byte x 51200 columns x 773 rows + Metadata       40 038 580 B
    HDF5             Matrix 3 bytes x 51200 columns x 773 rows + Metadata      35 192 973 B
    TXT + Zip​        CSV decimal (4,1) + standard zip compression              34 175 971 B
    PNGRGBa​          Matrix 4 bytes x 51200 columns x 773 rows                 33 997 095 B
    ZIP(BIN)​         Original BIN file compressed with standard zip            26 028 780 B
    PNG 8bIndexed​    Matrix 1 byte x 51200 columns x 773 rows + Color scale    25 947 324 B

The example using MySQL didn't use any serialization. I didn't try it but one may expect a reduction to almost half the size of the occupied storage by using WKT linestrings or similar features. Even so, the storage used would be almost the double of the corresponding CSV and more than 20 times the size of a PNG8b with the same data.

These numbers are expected when you stop to think about how much extra data you are storing in terms of keys and search optimization when you use an SQL DBMS.

For closing remarks, I would suggest that you consider the use of PNG, TIFF, HDF5, or any other digital format that is more suitable to construct your front-end to store the spectrum data (or any other large matrix) and maybe using an SQL DBMS for the dimensions around this core data, such as who measures, when, with which equipment, to which end, etc. In short, have a BLOB within the database with the files or outside, as it better suits your system architecture.

Alternatively, is worthwhile to consider using a big data solution around some digital format such as HDF5. Each tool to its end.

Upvotes: 0

Code Different
Code Different

Reputation: 93151

Your first solution is a common mistake when people transition from the procedural/OO programming mindset to the database mindset. It's all about efficiency, least number of records to fetch etc. The database world requires a different paradigm to store and retrieve data.

Here's how I'd do it: make 2 tables:

spectra
---------
spectra_id (primary key)
label
time

spectra_detail
---------
spectra_id
index
value

To retrieve them:

SELECT     *
FROM       spectra s
INNER JOIN spectra_detail sd ON s.spectra_id = sd.spectra_id
WHERE      s.spectra_id = 42

Upvotes: 4

Related Questions