Reputation: 10888
Below is the method that I have written for reading from a text file. While reading, I need to match line string to given regex and if it matches then I need to add the line string to a collection.
private static void GetOrigionalRGBColours(string txtFile)
{
string tempLineValue;
Regex regex = new Regex(@"^\d+.?\d* \d+.?\d* \d+.?\d* SRGB$");
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (null != (tempLineValue = inputReader.ReadLine()))
{
if (regex.Match(tempLineValue).Success
&& tempLineValue != "1 1 1 SRGB"
&& tempLineValue != "0 0 0 SRGB")
{
string[] rgbArray = tempLineValue.Split(' ');
RGBColour rgbColour = new RGBColour() { Red = Convert.ToDecimal(rgbArray[0]), Green = Convert.ToDecimal(rgbArray[1]), Blue = Convert.ToDecimal(rgbArray[2]) };
originalColourList.Add(rgbColour);
}
}
}
}
When this method is run for a text file of 4MB
having 28653 lines, it takes around 3 minutes just to finish the above method. Also, as a result of the above run, originalColourList
is populated with 582 items.
Can anyone please guide on how can I improve the performance of this method? The actual text file size may go up to 60MB
.
FYI-
Right Match for Regex: 0.922 0.833 0.855 SRGB
Wrong Match for Regex: /SRGB /setrgbcolor load def
The txt file is originally a postscript file, I have saved that as txt file for manipulation using C#.
Upvotes: 1
Views: 5221
Reputation: 2631
Depending on your record format you can parse faster than using a regex. Not knowing everything about your file, but from your two examples this is about 30% faster than using the optimized regex.
decimal r;
decimal g;
decimal b;
string rec;
string[] fields;
List<RGBColour> originalColourList = new List<RGBColour>();
using (StreamReader sr = new StreamReader(@"c:\temp\rgb.txt"))
{
while (null != (rec = sr.ReadLine()))
{
if (rec.EndsWith("SRGB"))
{
fields = rec.Split(' ');
if (fields.Length == 4
&& decimal.TryParse(fields[0], out r)
&& decimal.TryParse(fields[1], out g)
&& decimal.TryParse(fields[2], out b)
&& (r+g+b !=0)
&& (r != 1 && g != 1 && b!=1)
)
{
RGBColour rgbColour = new RGBColour() { Red = r, Green = g, Blue = b };
originalColourList.Add(rgbColour);
}
}
}
}
The if will short circuit as soon as any of the criteria are false, and if everything is true you no longer have to convert all the values to decimal. I parsed 6 million lines with this in approximately 12.5 seconds.
Upvotes: 0
Reputation: 55359
The regex will be much, much faster if you rewrite it like this:
Regex regex = new Regex(@"^\d+(\.\d*)? \d+(\.\d*)? \d+(\.\d*)? SRGB$");
Note two important changes:
.
is escaped with a backslash so that the regex matches a literal dot instead of any character.\.
and following \d*
are optional as a group, rather than \.
being optional by itself.The original regex is slow because \d+.?\d*
contains consecutive quantifiers (+
, ?
, and *
). This causes excessive backtracking when the regex engine attempts to match a line that starts with a long sequence of digits. On my machine, for example, a line containing 10,000 zeroes takes more than four seconds to match. The revised regex takes less than four milliseconds, a 1000x improvement.
The regex might be even faster (by a hair) if you pass
RegexOptions.Compiled | RegexOptions.ECMAScript
as the second argument to the Regex
constructor. ECMAScript
tells the regex engine to treat \d
as [0-9]
, ignoring Unicode digits like ༧ (Tibetan 7) which you don't care about.
Upvotes: 3