BowserKingKoopa
BowserKingKoopa

Reputation: 2741

FileHelpers throws OutOfMemoryException when parsing large csv file

I'm trying to parse a very large csv file with FileHelpers (http://www.filehelpers.net/). The file is 1GB zipped and about 20GB unzipped.

        string fileName = @"c:\myfile.csv.gz";
        using (var fileStream = File.OpenRead(fileName))
        {
            using (GZipStream gzipStream = new GZipStream(fileStream, CompressionMode.Decompress, false))
            {
                using (TextReader textReader = new StreamReader(gzipStream))
                {
                    var engine = new FileHelperEngine<CSVItem>();
                    CSVItem[] items = engine.ReadStream(textReader);                        
                }
            }
        }

FileHelpers then throws an OutOfMemoryException.

Test failed: Exception of type 'System.OutOfMemoryException' was thrown. System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount) at System.Text.StringBuilder.Append(Char value, Int32 repeatCount) at System.Text.StringBuilder.Append(Char value) at FileHelpers.StringHelper.ExtractQuotedString(LineInfo line, Char quoteChar, Boolean allowMultiline) at FileHelpers.DelimitedField.ExtractFieldString(LineInfo line) at FileHelpers.FieldBase.ExtractValue(LineInfo line) at FileHelpers.RecordInfo.StringToRecord(LineInfo line) at FileHelpers.FileHelperEngine1.ReadStream(TextReader reader, Int32 maxRecords, DataTable dt) at FileHelpers.FileHelperEngine1.ReadStream(TextReader reader)

Is it possible to parse a file this big with FileHelpers? If not can anyone recommend an approach to parsing files this big? Thanks.

Upvotes: 7

Views: 5780

Answers (2)

Marcos Meli
Marcos Meli

Reputation: 3506

You must work record by record in this way:

  string fileName = @"c:\myfile.csv.gz";
  using (var fileStream = File.OpenRead(fileName))
  {
      using (GZipStream gzipStream = new GZipStream(fileStream, CompressionMode.Decompress, false))
      {
          using (TextReader textReader = new StreamReader(gzipStream))
          {
            var engine = new FileHelperAsyncEngine<CSVItem>();
            using(engine.BeginReadStream(textReader))
            {
                foreach(var record in engine)
                {
                   // Work with each item
                }
            }
          }
      }
  }

If you use this async aproach you will only be using the memory for a record a time, and that will be much more faster.

Upvotes: 13

sammy_winter
sammy_winter

Reputation: 139

This isn't a complete answer, but if you have a 20GB csv file, you'll need 20GB+ to store the whole thing in memory at once unless your reader keeps everything compressed in memory (unlikely). You need to read the file in chunks, and the solution you're using of putting everything into an array will not work if you don't have huge amounts of ram.

You need a loop a bit more like this:

CsvReader reader = new CsvReader(filePath)
CSVItem item = reader.ReadNextItem();
while(item != null){
  DoWhatINeedWithCsvRow(item);
  item = reader.ReadNextItem();
}

C#'s memory management will then be smart enough to dispose of the old CSVItems as you go through them, provided you don't keep references to them hanging around.

A better version would read a chunk from the CSV (eg. 10,000 rows), deal with all those, then get another chunk, or create a task for DoWhatINeedWithCsvRow if you don't care about processing order.

Upvotes: 0

Related Questions