inutan
inutan

Reputation: 10888

Reading from text file with Regex

Below is the method that I have written for reading from a text file. While reading, I need to match line string to given regex and if it matches then I need to add the line string to a collection.

private static void GetOrigionalRGBColours(string txtFile)
{
    string tempLineValue;
    Regex regex = new Regex(@"^\d+.?\d* \d+.?\d* \d+.?\d* SRGB$");

    using (StreamReader inputReader = new StreamReader(txtFile))
    {
        while (null != (tempLineValue = inputReader.ReadLine()))                     
        {
            if (regex.Match(tempLineValue).Success
                && tempLineValue != "1 1 1 SRGB"
                && tempLineValue != "0 0 0 SRGB")
            {
                string[] rgbArray = tempLineValue.Split(' ');
                RGBColour rgbColour = new RGBColour() { Red = Convert.ToDecimal(rgbArray[0]), Green = Convert.ToDecimal(rgbArray[1]), Blue = Convert.ToDecimal(rgbArray[2]) };
                originalColourList.Add(rgbColour);
            }
        }
    }
} 

When this method is run for a text file of 4MB having 28653 lines, it takes around 3 minutes just to finish the above method. Also, as a result of the above run, originalColourList is populated with 582 items.

Can anyone please guide on how can I improve the performance of this method? The actual text file size may go up to 60MB.

FYI-
Right Match for Regex: 0.922 0.833 0.855 SRGB
Wrong Match for Regex: /SRGB /setrgbcolor load def
The txt file is originally a postscript file, I have saved that as txt file for manipulation using C#.

Upvotes: 1

Views: 5221

Answers (2)

Kevin
Kevin

Reputation: 2631

Depending on your record format you can parse faster than using a regex. Not knowing everything about your file, but from your two examples this is about 30% faster than using the optimized regex.

        decimal r;
        decimal g;
        decimal b;
        string rec;
        string[] fields;
        List<RGBColour> originalColourList = new List<RGBColour>();

        using (StreamReader sr = new StreamReader(@"c:\temp\rgb.txt"))
        {
            while (null != (rec = sr.ReadLine()))
            {
                if (rec.EndsWith("SRGB"))
                {
                    fields = rec.Split(' ');

                    if (fields.Length == 4 
                        && decimal.TryParse(fields[0], out r) 
                        && decimal.TryParse(fields[1], out g) 
                        && decimal.TryParse(fields[2], out b)
                        && (r+g+b !=0)
                        && (r != 1  && g != 1 && b!=1)
                        )
                    {
                        RGBColour rgbColour = new RGBColour() { Red = r, Green = g, Blue = b };
                        originalColourList.Add(rgbColour);
                    }
                }
            }
        }

The if will short circuit as soon as any of the criteria are false, and if everything is true you no longer have to convert all the values to decimal. I parsed 6 million lines with this in approximately 12.5 seconds.

Upvotes: 0

Michael Liu
Michael Liu

Reputation: 55359

The regex will be much, much faster if you rewrite it like this:

Regex regex = new Regex(@"^\d+(\.\d*)? \d+(\.\d*)? \d+(\.\d*)? SRGB$");

Note two important changes:

  1. Each . is escaped with a backslash so that the regex matches a literal dot instead of any character.
  2. Each \. and following \d* are optional as a group, rather than \. being optional by itself.

The original regex is slow because \d+.?\d* contains consecutive quantifiers (+, ?, and *). This causes excessive backtracking when the regex engine attempts to match a line that starts with a long sequence of digits. On my machine, for example, a line containing 10,000 zeroes takes more than four seconds to match. The revised regex takes less than four milliseconds, a 1000x improvement.

The regex might be even faster (by a hair) if you pass

RegexOptions.Compiled | RegexOptions.ECMAScript

as the second argument to the Regex constructor. ECMAScript tells the regex engine to treat \d as [0-9], ignoring Unicode digits like ༧ (Tibetan 7) which you don't care about.

Upvotes: 3

Related Questions