Reputation: 17038
I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it might also be worth trying to parse multiple sections within a single file in parallel. For this to be efficient there must be a fast way to find the start and end points of such sections. For example, if you are parsing a large serialized data structure, the format might allow you to easily skip over segments within the file, so that you can chop up the input into multiple independent parts that can be parsed in parallel. Another example could be a programming languages whose grammar makes it easy to skip over a complete class or function definition, e.g. by finding the closing brace or by interpreting the indentation. In this case it might be worth not to parse the definitions directly when they are encountered, but instead to skip over them, push their text content into a queue and then to process that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
Upvotes: 4
Views: 974
Reputation: 3838
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.
Upvotes: 5