tmesser
tmesser

Reputation: 7666

Read in a file using a regular expression?

This is tangentially related to an earlier question of mine.

Essentially, the solution in that question worked great, but now I need to adapt it to work in a much larger analysis application. Simply using StreamReader.ReadToEnd() is not acceptable, since some of the files I will be reading in are very, very large. If there's been a mistake and someone forgot to clean up, they can theoretically be gigabytes big. Obviously I can't just read to the end of that.

Unfortunately, the normal read lines is also not acceptable, because some of the rows of data I am reading in contain stack traces - they obviously use /r/n in their formatting. Ideally, I would like to tell the program to read forward until it hits a match for a regex, which it then returns. Is there any functionality to do this in .net? If not, can I get some suggestions for how I'd go about writing it?

Edit: To make it a bit easier to follow my question, here's a paste of some of the important parts of the adapted code:

foreach (var fileString in logpath.Select(log => new StreamReader(log)).Select(fileStream => fileStream.ReadToEnd()))
{
    const string junkPattern = @"\[(?<junk>[0-9]*)\] \((?<userid>.{0,32})\)";
    const string severityPattern = @"INFO|ERROR|FATAL";
    const string datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
    var records = Regex.Split(fileString, datePattern, RegexOptions.Multiline);
    foreach (var record in records.Where(x => string.IsNullOrEmpty(x) == false))
    ......

The problem lies in the Foreach. .Select(fileStream => fileStream.ReadToEnd()) is gonna blow up memory badly, I just know it.

Upvotes: 5

Views: 2413

Answers (1)

VMAtm
VMAtm

Reputation: 28366

First off all, you should move your const definition to class declaration - the compiler will do that for you, but this should be done by yourself, just for better code readability.

As @Blam mentioned, you should use StringBuilder and StreamReader.ReadLine in pair, something like this:

foreach(var filePath in logpath)
{
    var sbRecord = new StringBuilder();
    using(var reader = new StreamReader(filePath))
    {
        do
        {
            var line = reader.ReadLine();
            // check start of the new record lines
            if (Regex.Match(line, datePattern) && sbRecord.Length > 0)
            {
                // your method for log record
                HandleRecord(sbRecord.ToString());
                sbRecord.Clear();
                sbRecord.AppendLine(line);
            }
            // if no lines were added or datePattern didn't hit
            // append info about current record
            else
            {
                sbRecord.AppendLine(line);
            }
        } while (!reader.EndOfStream)
    }
}

If I didn't understand something about your problem, please clarify this in comment.
Also, you can use ThreadPool for schedule the tasks for your lines, just for speed of your application.

Upvotes: 1

Related Questions