Reputation: 943
I want to write an app that can process large amounts of data (let's say, years of tick price data). The data can come from a file server, the Web, etc, but the idea is there's too much of it to hold in the computer's memory at one time. As I process the data, I'll write the results out (let's say, to disk), then I can discard the data.
I'm working in F# so feedback relating to .NET is most helpful. I don't have to have concrete answers, just pointers to good reading in this problem area would be very much appreciated.
Is there a design pattern or toolkit for this? It seems similar to dataflow programming, in that I only want to work on part of the available data at one time, except that unlike dataflow programming I want to pull the data in rather than wait for it to arrive and then react.
I also want to do parallel processing of this data. The way I'm currently thinking of structuring this is: a. Each thread requests some data to work with. b. A data reader pulls in as large a chunk of the requested data as can be cached in the computer's memory. When the thread finishes with this chunk, another chunk can be pulled in and cached. c. The data reader also knows what chunks are currently cached, so that if multiple threads request the same chunk, they can all read from the same cache (they won't have to write to it). Again, is there a .NET data structure or design pattern for this?
Finally, is all this work just overengineering the wheel? I.e., for instance, is it better to just try to suck the entire data stream into an array or hash and let OS paging worry about the issues I describe above?
I imagine SQL Server deals with issues like this, but the data I want to read might not be in a database and I'd prefer not to introduce a dependency on SQL Server. I also know that F# has sequences for lazy evaluation of data, but I'm not sure that applies to random access of data - i.e. I might want to get data from any place in the entire stream, and only from that point will I be accessing it sequentially.
Upvotes: 0
Views: 241
Reputation: 22719
The main question seems to be answered quite nicely by using the Stream classes in .NET. Streams can be implemented over just about anything (memory, file, network, etc.) So, if you write your code to read in from a stream and write out to a different stream, you can change the read or write implementation without changing the rest of the code.
As far as parallel processing is concerned, I assume there is a "record" concept in the large files. If that is the case and since you're using F#, you should just be able to create an iterator over the stream, then use F#'s parallelism features to process each record.
Upvotes: 3
Reputation: 6366
I would use a master/slave design pattern, which is kind of where I think you were going with 2. Do not let the OS page the data, you will have horrible slowdown and your application will never finish.
Upvotes: 1