Reading a CSV file with 50M lines, how to improve performance

Question

I have a data file in CSV (Comma-Separated-Value) format that has about 50 million lines in it.

Each line is read into a string, parsed, and then used to fill in the fields of an object of type FOO. The object then gets added to a List(of FOO) that ultimately has 50 million items.

That all works, and fits in memory (at least on an x64 machine), but its SLOW. It takes like 5 minutes every time load and parse the file into the list. I would like to make it faster. How can I make it faster?

The important parts of the code are shown below.

Public Sub LoadCsvFile(ByVal FilePath As String)
    Dim s As IO.StreamReader = My.Computer.FileSystem.OpenTextFileReader(FilePath)

    'Find header line
    Dim L As String
    While Not s.EndOfStream
        L = s.ReadLine()
        If L = "" Then Continue While 'discard blank line
        Exit While
    End While
    'Parse data lines
    While Not s.EndOfStream
        L = s.ReadLine()
        If L = "" Then Continue While 'discard blank line          
        Dim T As FOO = FOO.FromCSV(L)
       Add(T)
    End While
    s.Close()
End Sub


Public Class FOO
    Public time As Date
    Public ID As UInt64
    Public A As Double
    Public B As Double
    Public C As Double

    Public Shared Function FromCSV(ByVal X As String) As FOO
        Dim T As New FOO
        Dim tokens As String() = X.Split(",")
        If Not DateTime.TryParse(tokens(0), T.time) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid ISO 8601 timestamp")
        End If
        If Not UInt64.TryParse(tokens(1), T.ID) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid ID")
        End If
        If Not Double.TryParse(tokens(2), T.A) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for A")
        End If
        If Not Double.TryParse(tokens(3), T.B) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for B")
        End If
        If Not Double.TryParse(tokens(4), T.C) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for C")
        End If
        Return T
    End Function
End Class

I did some benchmarking and here are the results.

The complete algorithm above took 314 seconds to load the whole file and put the objects into the list.
With the body of FromCSV() reduced to just returning a new object of type FOO with default field values, the whole process took 84 seconds. Therefore it appears that processing the line of text into the object fields is taking 230 seconds (73% of the total time).
Doing everything but parsing the ISO 8601 date string takes 175 seconds. Therefore it appears that processing the date string takes 139 seconds, which is 60% of the text processing time, just for that one field.
Just reading the lines in the file without any processing or object creating takes 41 seconds.
Using StreamReader.ReadBlock to read the whole file in chunks of about 1KB takes 24s, but its a minor improvement in the grand scheme of things and probably not worth the added complexity. In order to use TryParse I would now need to manually create the temporary strings rather than using String.Split().

At this point the only path I see is to just display status to the user every few seconds so they don't wonder if the program is frozen or something.

UPDATE
I created two new functions. One can save the dataset from memory into a binary file using System.IO.BinaryWriter. The other function can load that binary file back into memory using System.IO.BinaryReader. The binary versions were considerably faster than CSV versions, and the binary files take up much less space.

Here are the benchmark results (same dataset for all tests):

LOAD CSV: 340s
SAVE CSV: 312s
SAVE BIN: 29s
LOAD BIN: 41s
CSV FILE SIZE: 3.86GB
BIN FILE SIZE: 1.63GB

MarkPflug · Accepted Answer

I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.

As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.

Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.

static List Read(string file)
{
    // estimate of the average row length based on Andrew Morton's 4GB/50m
    const int AverageRowLength = 80;

    var textReader = File.OpenText(file);
    // specifying the DateFormat will cause TryParseExact to be used.
    var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
    var csvReader = CsvDataReader.Create(textReader, csvOpts);

// estimate number of rows to avoid growing the list.
    var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);            
    var data = new List(estimatedRows);

    while (csvReader.Read())
    {
        if (csvReader.RowFieldCount < 5) continue;
        var item = new Foo()
        {
            time = csvReader.GetDateTime(0),
            ID = csvReader.GetInt64(1),
            A = csvReader.GetDouble(2),
            B = csvReader.GetDouble(3),
            C = csvReader.GetDouble(4)
        };
        data.Add(item);
    }
    return data;
}

I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.

Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.

Here are some examples of how to do this in C# (don't know VB very well, sorry).


static List Read(string file)
{
    var stream = File.OpenRead(file);
    // the exact number of records can be determined by looking at the length of the file.
    var recordCount = stream.Length / 40;
    var data = new List(recordCount);
    var br = new BinaryReader(stream);
    for (int i = 0; i < recordCount; i++)
    {
        var ticks = br.ReadInt64();
        var id = br.ReadInt64();
        var a = br.ReadDouble();
        var b = br.ReadDouble();
        var c = br.ReadDouble();
        var f = new Foo()
        {
            time = new DateTime(ticks),
            ID = id,
            A = a,
            B = b,
            C = c,
        };
        data.Add(f);
    }
    return data;
}

static void Write(List data, string file)
{
    var stream = File.Create(file);
    var bw = new BinaryWriter(stream);
    foreach(var item in data)
    {
        bw.Write(item.time.Ticks);
        bw.Write(item.ID);
        bw.Write(item.A);
        bw.Write(item.B);
        bw.Write(item.C);
    }
}

This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.

This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.

If you find this solution to work, I'd love to hear the results.

Reading a CSV file with 50M lines, how to improve performance

Answers (1)

Related Questions