topwik
topwik

Reputation: 3527

how can i optimize the performance of this regular expression?

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces. I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long. I saw an example of using a regex on file content that looked like this

String FileContent = ReadFile(FilePath, ErrInfo);        
Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");

That replace can understandably take its sweet time on a decent sized file.

Is there a more efficient way to run this regex? Would it be faster to read the file line by line and run the regex per line?

Upvotes: 3

Views: 447

Answers (3)

Kobi
Kobi

Reputation: 138017

The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:

Regex csvRegex = new Regex(@"
    (?<Quoted>
        ""                  # Open quotes
        (?:[^""]|"""")*     # not quotes, or two quotes (escaped)
        ""                  # Closing quotes
    )
    |                       # OR
    (?<Comma>,)             # A comma
    ",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
                        match => match.Groups["Comma"].Success ? "\t" : match.Value);

Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.

Upvotes: 4

sehe
sehe

Reputation: 393064

The simplest optimization would be

Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
    Console.WriteLine(r.Replace(line, "\t"));

I haven't profiled it, but I wouldn't be surprised if the speedup was huge.

If that's not enough I suggest some manual labour:

var input = new StreamReader(File.OpenRead("input.txt"));

char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
    var result = new StringBuilder(line);
    bool inquotes = false;

    for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
    {
        bool isquote = (line[index] == '\"');
        inquotes = inquotes != isquote;

        if (!(isquote || inquotes))
            result[index] = '\t';
    }
    Console.WriteLine (result);
}

PS: I assumed @"\t" was a typo for "\t", but perhaps it isn't :)

Upvotes: 2

Peter O.
Peter O.

Reputation: 32878

It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).

In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).

Alternatively, you can check whether each line has quotes and use a simpler method accordingly.

Upvotes: 6

Related Questions