Reputation: 10863
I'm working on a scientific application in C for sending instructions telling a device to perform an experiment, then reading out the data from that experiment, and specifically for automating that function so that it can largely be performed without my input.
I'm storing the pulse programs and the data files as ASCII files at the moment, but there are two issues with this - the first is that I find that when I load the data files into matlab matrices and save them from matlab, then read them in again later, it takes 100x longer to read from the ASCII files than to read from the .mat files - this jibes with my previous experience that these ASCII read/write operations are the slowest part of my program. The second issue is that any time I try and increase the versatility of the program, I have to create a new format specification for the storage of the files, which is annoying.
I'm thinking about finding an SQL library and storing everything as relational databases, but from what I know about databases, you aren't meant to create a large number of small databases (data files are between maybe 50k and 50M, program files are very small). I'm really looking for something like the Matlab save() function, where I can dump a struct() variable into a file, then read it out by name. A database would make that easy enough, but it seems like there must be a more tailored way to store files that way. Any suggestions?
Edit Sorry for the vague language here. I was trying not to get bogged down in specifics so that the question could have more broad applicability, but I see the folly in that now. Here's exactly what I do:
Starting from a pulse program that is saved in file, here are the steps I'm doing:
1.) Read pulse program from an ASCII file that looks like this:
#ValidPulseProgram#
NInstructions= 15
NTransients= 1
DelayTime= 0.000000
TriggerTTL= 0
NPoints= 2000
SamplingRate= 2000.000000
PhaseCycle= 0
NumCycles= 4
CycleInstr= 2
CycleFreq= 60.000000
Dimensions= 2
[Point]
IndirectDim 1 - 1 of 51
[Instructions]
Instruction 0 0 4 0 0 10.000000 1000000000.000000
...
Instruction 2 0 266 0 0 750.000000 1000000.000000
Instruction 14 0 4 1 0 100.000000 1000000.000000
[EndInstructions]
nVaried= 1
VaryInstr 0 5 0 -1.000000 24 -1.000000 1200 -1.000000 1
This is read out line-by-line and put into a struct that contains all the information.
2.) Send the struct to a program which translates it into something readable by the device which sets up the timings and such and starts the experiment.
3.) Data comes in and is stored in ASCII files, which have a header containing the program that was used, some other information about the acquisition. Each acquisition is stored as a separate ASCII file. There are sometimes thousands of these files, organized in a folder hierarchy.
4.) Later on, I want to be able to read out from the ASCII files. I either read them out from this C program I'm working on, or I read them out using a Matlab script that reads them into matlab variables (for more intense data analysis). The last step in that script is a call to save() which saves the .mat matlab file. For acquisitions with say 50 files of 1000 points each, it takes around 10-30 seconds to get all the data out into an array. If I save that array to a .mat file and later read it into the workspace, it takes milliseconds.
So the two problems are step 4.) -> I should be saving these immediately in such a way that I can read them out in milliseconds, since it shouldn't take 30s to read a few MB from file, and step 1.), where I'd like to change that ASCII file into something like a binary file containing a struct.
Upvotes: 1
Views: 275
Reputation: 93476
MATLAB has a C/C++ and Fortran API library that includes a MAT-File Library. That would be the most obvious solution.
When reading an ASCII file, MATLAB it is possible perhaps that it adds each value to the matrix variable without a priori knowledge of the ultimate size, so it will constantly allocate, reallocate, and move data in memory as the matrix size grows - for large data sets, this will often involve virtual memory disk-swapping, and can be very slow. Either way it is both slow and non-deterministic. When a .MAT file is read, it allocates the correct size once and loads the data in one go.
Upvotes: 4
Reputation: 42277
HDF5 is a library/file format designed as a database for scientific data. It is slightly more complex than just dumping into ASCII, but it was optimized for speed and has bindings for quite a lot of languages (C, Fortran, Python, also it seems Matlab has a builtin capabilities too).
I don't know if HDF5 is common in your domain, but it seems to me its better suited than SQL databases. SQL provides ability to do complex queries, which might be unnecessary for you.
Upvotes: 3
Reputation: 16677
you would not create one database per file. you would instead create one table that could hold the files, and insert the file as a record.
alternately, you would build a proper structure that could be used to re-constitute each file from its various parts. in this way, you would presumably be able to tweak the file contents as data and then be good to go. (as opposed to having to know the file structure and edit it as a whole)
also, you may consider an XML structure instead of just ascii. you could then take advantage of existing parsing tools to get to the juicy bits in the file, efficiency is not too bad here.
Upvotes: 0