Reputation: 439

Reading a large text file (over 4 million lines) and parsing each line in .NET

I have one log file for each day of the month. These files are plain text with some info in each line like the snippet below:

1?2017-06-01T00:00:00^148^3
2?myVar1^3454.33
2?myVar2^35
2?myVar3^0
1?2017-06-01T00:00:03^148^3
...

To process and show this data, I'm developing a WPF application that reads these txt files, parses the lines and saves this data in a SQLite database. Then, I allow the user to make some basic math operations like AVG of a subset.

As these files are too large (over 300mb and 4 million lines each), I'm struggling with memory usage in the ProcessLine method (as far as I know, the reading part is ok for now). The method never finishes and the application enters in break mode by itself.

My code:

private bool ParseContent(string filePath)
    {
        if (string.IsNullOrEmpty(FilePath) || !File.Exists(FilePath))
            return false;

        string logEntryDateTimeTemp = string.Empty;

        string [] AllLines = new string[5000000]; //only allocate memory here
        AllLines = File.ReadAllLines(filePath);
        Parallel.For(0, AllLines.Length, x =>
        {
            ProcessLine(AllLines[x], ref logEntryDateTimeTemp);
        });

        return true;
    }

    void ProcessLine(string line, ref string logEntryDateTimeTemp)
    {
        if (string.IsNullOrEmpty(line))
            return;

        var logFields = line.Split(_delimiterChars);

        switch (logFields[0])
        {
            case "1":
                logEntryDateTimeTemp = logFields[1];
                break;
            case "2":
                LogEntries.Add(new LogEntry
                {
                    Id = ItemsCount + 1,
                    CurrentDateTime = logEntryDateTimeTemp,
                    TagAddress = logFields[1],
                    TagValue = Convert.ToDecimal(logFields[2])
                });

                ItemsCount++;
                break;
            default:
                break;
        }
    }

Is there a better way of doing it?

OBS: I've also tested two other methods for reading the file, which are:

        #region StreamReader
        //using (StreamReader sr = File.OpenText(filePath))
        //{
        //    string line = String.Empty;
        //    while ((line = sr.ReadLine()) != null)
        //    {
        //        if (string.IsNullOrEmpty(line))
        //            break;

        //        var logFields = line.Split(_delimiterChars);

        //        switch (logFields[0])
        //        {
        //            case "1":
        //                logEntryDateTimeTemp = logFields[1];
        //                break;
        //            case "2":
        //                LogEntries.Add(new LogEntry
        //                {
        //                    Id = ItemsCount + 1,
        //                    CurrentDateTime = logEntryDateTimeTemp,
        //                    TagAddress = logFields[1],
        //                    TagValue = Convert.ToDecimal(logFields[2])
        //                });

        //                ItemsCount++;
        //                break;
        //            default:
        //                break;
        //        }
        //    }
        //}
        #endregion

        #region ReadLines
        //var lines = File.ReadLines(filePath, Encoding.UTF8);

        //foreach (var line in lines)
        //{
        //    if (string.IsNullOrEmpty(line))
        //        break;      

        //    var logFields = line.Split(_delimiterChars);

        //    switch (logFields[0])
        //    {
        //        case "1":
        //            logEntryDateTimeTemp = logFields[1];
        //            break;
        //        case "2":
        //            LogEntries.Add(new LogEntry
        //            {
        //                Id = ItemsCount + 1,
        //                CurrentDateTime = logEntryDateTimeTemp,
        //                TagAddress = logFields[1],
        //                TagValue = Convert.ToDecimal(logFields[2])                          
        //            });

        //            ItemsCount++;
        //            break;
        //        default:
        //            break;
        //    }             
        //}
        #endregion

OBS2: I'm using Visual Studio 2017, and when the application is running in debug mode, the application suddenly enters in break mode, and the message in the Output window reads as follows:

The CLR has been unable to transition from COM context 0xb545a8 to COM context 0xb544f0 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.

Upvotes: 3

Answers (4)

Damien Doumer

Reputation: 2286

C# has features that allows you to process large files smoothly, and without risking an out of memory exception.

A best practice is to process each line, then immediately return the result to the output stream, another file, or even a database without saturating the memory.

First, iterate over the lines in the file, using the using the stream reader.

Then yield return the result to the output source (that is, write the results to a file or database or the output stream). This will free up the memory immediately on every new line.

using var sr = new System.IO.StreamReader(filePath))

while ((line = sr.ReadLine()) != null)
{
    // Process the line of text
    yield return processedText;
}

To understand this better, read about yield return here: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/statements/yield

Upvotes: 0

Tim Schmelter

Reputation: 460208

You probably get the exception at LogEntries.Add in ProcessLine, because you have so many log entries that this collection gets too large for memory.

So you should store the entries into database immediately without adding them to the list.

But you should read only one line, then process it, then read the next line and forget the previous one. File.ReadAllLines will read all lines at once into a string[] which will occupy the memory(or cause an OutOfMemoryException).

You could use a StreamReader os File.ReadLines instead.

Upvotes: 2

dlxeon

Reputation: 2010

You should use StreamReader and read line by line. That will reduce memory usage for reading.

Also you should keep relatively small buffer of parsed records being added to database. That may be about 1000 records. Once collection reaches 1000 items, you should write that to the database (ideally in single transaction with bulk insert), clean up collection and move to next input file chunk.

Good approach would be to remember processed position in input file to make sure application will resume from last point in case of failure.

Upvotes: 1

mm8

Reputation: 169320

Try to use a StreamReader instead of loading the entire file into memory at once:

using (System.IO.StreamReader sr = new System.IO.StreamReader(filePath))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {
        //..
    }
}

Upvotes: 4

Reading a large text file (over 4 million lines) and parsing each line in .NET

Answers (4)

Related Questions