Reputation: 7725
Is it somehow possible to chain together several LINQ queries on the same IEnumerable ?
Some background,
I've some files, 20-50Gb in size, they will not fit in memory. Some code parses messages from such a file, and basically does :
public IEnumerable<Record> ReadRecordsFromStream(Stream inStream) {
Record msg;
while ((msg = ReadRecord(inStream)) != null) {
yield return msg;
}
}
This allow me to perform interesting queries on the records. e.g. find the average duration of a Record
var records = ReadRecordsFromStream(stream);
var avg = records.Average(x => x.Duration);
Or perhaps the number of records per hour/minute
var x = from t in records
group t by t.Time.Hour + ":" + t.Time.Minute into g
select new { Period = g.Key, Frequency = g.Count() };
And there's a a dozen or so more queries I'd like to run to pull relevant info out of these records. Some of the simple queries can certainly be combined in a single query, but this seem to get unmanegable quite fast.
Now, each time I run these queries, I have to read the file from the beginning again, all records reparsed - parsing a 20Gb file 20 times takes time, and is a waste.
What can I do to be able to do just one pass over the file, but run several linq queries against it ?
Upvotes: 2
Views: 277
Reputation: 950
I have done this before for logs with 3-10MB/file. Haven't reached that file size but I tried to execute this in a 1GB+ total log files without consuming that much of RAM. You may try what I did.
Upvotes: 0
Reputation: 1500185
You might want to consider using Reactive Extensions for this. It's been a while since I've used it, but you'd probably create a Subject<Record>
, attach all your queries to it (as appropriate IObservable<T>
variables) and then hook up the data source. That will push all the data through the various aggregations for you, only reading from disk once.
While the exact details elude me without downloading the latest build myself, I blogged on this a couple of times: part 1; part 2. (Various features that I complained about being missing in part 1 were added :)
Upvotes: 5
Reputation: 317
There's a technology that allows you to do this kind of thing. It's called a database :)
Upvotes: -1