Reputation: 1286
I am struggling with the processing of binary (file) data (in c#). This is the situation:
I have a binary file that can be as small as 1 MB and as large as 60 GB, hence impossible to fit in memory (assume slow laptop with 2 GB ram, both running 32 and 64 bit windows). This file contains data from, for example, 20 sources along a time base. The file's header does not inform me about the length of the signals, meaning that the length of each signal can (and most of the wime will) differ. Therefore I do not know the amount of bytes that one signal contains on forehand. Also note that the data is unevenly spaced along the file. Therefore I have to search for identifiers (2 bytes) in the file that matches the corresponding signal sample.
Secondly I need to process and store this data in a new binary file. The file size will be roughly the same. But the binary format is completely different. In fact it is a Matlab binary file format.
These are the challenges:
I already tried memory mapping of the file to read, but I am stuck with this since I would need to search along the complete file to know the length of the signals.
What would be a good approach to accomplish the above?
Upvotes: 1
Views: 257
Reputation: 13545
How about a signal splitter? I would scan once through the file and create a new file for each new signal. During reading the huge file I would write the data to the right signal file. Memory is a non issue since you are always reading only one chunk from disk (about 4KB) which does not get you neary any memory boundary.
If you need some correlation between the signals you would need to insert some timepoint markers into the split files to analyze it time based. That should also be rather easy to do.
As added bonus you do know the signal length by reading its file length or if you do time based stuff I would write a header with the final signal length found during the splitting process.
Upvotes: 0
Reputation: 171216
I would repeatedly scan the entire input file sequentially. Each pass through the file, I would collect as many "signals" in memory as memory can hold. Once the working memory buffer is full I would write the collected "signals" into an output Matlab file and start again to collect more signals with the next pass. Once no more new signals were found the algorithm ends.
This algorithm requires multiple passes over the data for big files but at least it is sequential IO which is rather fast.
Upvotes: 1