Reputation: 1009
A WPF .NET 4.5 app that I have been developing, initially to work on small data volumes, now works on much larger data volumes in the region of 1 million and more and of course I started running out of memory. The data comes from a MS SQL DB and data processing needs to be loaded to a local data structure, because this data is then transformed / processed / references by the code in CLR a continuous and uninterrupted data access is required, however not all data has to be loaded into memory straight away, but only when it is actually accessed. As a small example an Inverse Distance Interpolator uses this data to produce interpolated maps and all data needs to be passed to it for a continuous grid generation.
I have re-written some parts of the app for processing data, such as only load x amount of rows at any given time and implement a sliding window approach to data processing which works. However doing this for the rest of the app will require some time investment and I wonder if there can be a more robust and standard way of approaching this design problem (there has to be, I am not the first one)?
tldr; Does C# provide any data structures or techniques for accessing large data amounts in an interrupted manner, so it behaves like a IEnumerable but data is not in memory until it is actually accessed or required, or is it completely up to me to manage memory usage? My ideal would be a structure that would automatically implement a buffer like mechanism and load in more data as of when that data is accessed and freeing memory from the data that has been accessed and no longer of interest. Like some DataTable with an internal buffer maybe?
Upvotes: 4
Views: 2271
Reputation: 2556
@Sergey The producer-consumer model is probably your safest solution (Proposed by Jim Mischel) for complete scalability.
However, if you were to increase the room for the elephant (using your visual metaphor that fits very well), then compression on the fly is a viable option. Decompress when used and discard after use, leaving the core data structure compressed in memory. Obviously it depends on the data - how much it lends itself to compression, but there is a hell of alot of room in most data structures. If you have ON and OFF flags for some meta data, this can be buried in the unused bits of 16/32 bit numbers, or at least held in bits not bytes; use 16 bit integers for lat / longs with a constant scaling factor to convert each to real numbers before use; strings can be compressed using winzip type libraries - or indexed so that only ONE copy is held and no duplicates exist in memory, etc....
Decompression (albeit custom made) on the fly can be lightning fast.
This whole process can be very laborious I admit, but can definitely keep the room large enough as the elephant grows - in some instances. (Of course, it may never be good enough if the data is simply growing indefinitely)
EDIT: Re any sources...
Hi @Sergey, I wish I could!! Truly! I have used this technique for data compression and really the whole thing was custom designed on a whiteboard with one or two coders involved.
Its certainly not (all) rocket science, but its good to fully scope out the nature of all the data, then you know (for example) that a certain figure will never exceed 9999, so then you can choose how to store it in minimum bits, and then allocate the left over bits (assuming 32 bit storage) to other values. (A real world example is the number of fingers a person has...loosely speaking you could set an upper limit at 8 or 10, although 12 is possible, and even 20 is remotely feasible, etc if they have extra fingers. You can see what I mean) Lat / Longs are the PERFECT example of numbers that will never cross logical boundaries (unless you use wrap around values...). That is, they are always in between -90 and +90 (just guessing which type of Lat Longs) - which is very easy to reduce / convert as the range of values is so neat.
So we did not rely 'directly' on any third party literature. Only upon algorithms designed for specific types of data.
In other projects, for fast real time DSP (processing) the smarter (experienced game programmers) coders would convert floats to 16 bit ints and have a global scaling factor calculated to give max precision for the particular data stream (accelerometers, LVDT, Pressure gauges, etc) you are collecting.
This reduced the transmitted AND stored data without losing ANY information. Similarly, for real time wave / signal data you could use (Fast) Fourier Transform to turn your noisy wave into its Amplitude, Phase and Spectrum components - literally half of the data values, without actually losing any (significant) data. (Within these algorithms, the data 'loss' is completely measurable - so you can decide if you are in fact losing data)
Similarly there are algorithms like Rainfall Analysis (nothing to do with rain, more about cycles and frequency) which reduces your data alot. Peak detection and vector analysis can be enough for some other signals, which basically throws out about 99% of the data...The list is endless, but the technique MUST be intimately suited to your data. And you may have many different types of data, each lending itself to a different 'reduction' technique. I'm sure you can google 'lossless data reduction' (although I think the term lossless is coined by music processing and a little misleading since digital music has already lost the upper and lower freq ranges...I digress)....Please post what you find (if of course you have the time / inclination to research this further)
I would be interested to discuss your meta data, perhaps a large chunk can be 'reduced' quite elegantly...
Upvotes: 2
Reputation: 133995
As far as iterating through a very large data set that is too large to fit in memory goes, you can use a producer-consumer model. I used something like this when I was working with a custom data set that contained billions of records--about 2 terabytes of data total.
The idea is to have a single class that contains both producer and consumer. When you create a new instance of the class, it spins up a producer thread that fills a constrained concurrent queue. And that thread keeps the queue full. The consumer part is the API that lets you get the next record.
You start with a shared concurrent queue. I like the .NET BlockingCollection for this.
Here's an example that reads a text file and maintains a queue of 10,000 text lines.
public class TextFileLineBuffer
{
private const int QueueSize = 10000;
private BlockingCollection<string> _buffer = new BlockingCollection<string>(QueueSize);
private CancellationTokenSource _cancelToken;
private StreamReader reader;
public TextFileLineBuffer(string filename)
{
// File is opened here so that any exception is thrown on the calling thread.
_reader = new StreamReader(filename);
_cancelToken = new CancellationTokenSource();
// start task that reads the file
Task.Factory.StartNew(ProcessFile, TaskCreationOptions.LongRunning);
}
public string GetNextLine()
{
if (_buffer.IsCompleted)
{
// The buffer is empty because the file has been read
// and all lines returned.
// You can either call this an error and throw an exception,
// or you can return null.
return null;
}
// If there is a record in the buffer, it is returned immediately.
// Otherwise, Take does a non-busy wait.
// You might want to catch the OperationCancelledException here and return null
// rather than letting the exception escape.
return _buffer.Take(_cancelToken.Token);
}
private void ProcessFile()
{
while (!_reader.EndOfStream && !_cancelToken.Token.IsCancellationRequested)
{
var line = _reader.ReadLine();
try
{
// This will block if the buffer already contains QueueSize records.
// As soon as a space becomes available, this will add the record
// to the buffer.
_buffer.Add(line, _cancelToken.Token);
}
catch (OperationCancelledException)
{
;
}
}
_buffer.CompleteAdding();
}
public void Cancel()
{
_cancelToken.Cancel();
}
}
That's the bare bones of it. You'll want to add a Dispose method that will make sure that the thread is terminated and that the file is closed.
I've used this basic approach to good effect in many different programs. You'll have to do some analysis and testing to determine the optimum buffer size for your application. You want something large enough to keep up with the normal data flow and also handle bursts of activity, but not so large that it exceeds your memory budget.
If you want to support IEnumerable<T>
, you have to make some minor modifications. I'll extend my example to support IEnumerable<String>
.
First, you have to change the class declaration:
public class TextFileLineBuffer: IEnumerable<string>
Then, you have to implement GetEnumerator:
public IEnumerator<String> GetEnumerator()
{
foreach (var s in _buffer.GetConsumingEnumerable())
{
yield return s;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
With that, you can initialize the thing and then pass it to any code that expects an IEnumerable<string>
. So it becomes:
var items = new TextFileLineBuffer(filename);
DoSomething(items);
void DoSomething(IEnumerable<string> list)
{
foreach (var s in list)
Console.WriteLine(s);
}
Upvotes: 5