WSkid
WSkid

Reputation: 2786

C# Multiple Regex Replaces on String - Too Much Memory

Basically what I would like to do is run multiple (15-25) regex replaces on a single string with the best possible memory management.

Overview: Streams a text only file (sometimes html) via ftp appending to a StringBuilder to get a very large string. The file size ranges from 300KB to 30MB.

The regular expressions are semi-complex, but require multiple lines of the file (identifying sections of a book for example), so arbitrarily breaking the string, or running the replace on every download loop is out of the answer.

A sample replace:

Regex re = new Regex("<A.*?>Table of Contents</A>", RegexOptions.IgnoreCase);
source = re.Replace(source, "");

With each run of a replace the memory sky rockets, I know this is because string are immutable in C# and it needs to make a copy - even if I call GC.Collect() it still doesn't help enough for a 30MB file.

Any advice on a better way to approach, or a way to perform multiple regex replaces using constant memory (make 2 copies (so 60MB in memory), perform search, discard copy back to 30MB)?

Update:

There does not appear to be a simple answer but for future people looking at this I ended up using a combination of all the answers below to get it to an acceptable state:

  1. If possible split the string into chunks, see manojlds's answer for a way to that as the file is being read - looking for suitable end points.

  2. If you can't split as it streams, at least split it later if possible - see ChrisWue's answer for some external tools that may help with this process to piping to files.

  3. Optimize the regex, avoid greedy operators and try to limit what the engine has to do as much as possible - see Sylverdrag's answer.

  4. Combine the regex when possible, this cuts down the number of replaces for when the regexs are not based on each other (useful in this case for cleaning bad input) - see Brian Reichle's answer for a code sample.

Thank you all!

Upvotes: 8

Views: 3581

Answers (4)

Sylver
Sylver

Reputation: 8967

I have a fairly similar situation.

Use the compile option for the regex:

Source = Regex.Replace(source, pattern, replace, RegexOptions.Compiled);

Depending on your situation, it can make a major difference in speed.

Not a complete solution, especially for files larger than 3-4 Mb.

If you get to decide which regex should be run (not my case), you should probably optimize the regex as much as possible, avoiding the costly operations. For instance, avoid ungreedy operators, avoid look aheads and look behind.

Instead of using:

<a.*?>xxx

use

<a[^<>]*>xxx

The reason being that an ungreedy operator forces the regex engine to check each and every character compared to the rest of the expression whereas [^<>] only requires to compare the current character to < and > and stops as soon as the condition is matched. On a large file, this can make the difference between half a second and an application freeze.

It doesn't totally solve the problem, but it should help.

Upvotes: 2

ChrisWue
ChrisWue

Reputation: 19020

Assuming that the documents you load have some kind of structure you might be better off writing a parser to put the document into a stuctured form, breaking the large string into multiple chunks, and then operate on that structure.

One problem with large string is that objects over 85,000 bytes are considered large objects and put on the large object heap which is not compacted and it can lead to unexpected out of memory situations.

Another option would be to pipe it through an external tool like sed or awk.

Upvotes: 1

Brian Reichle
Brian Reichle

Reputation: 2856

Depending on the nature of the RegEx's, you might be able to combine them into a single regular expression and use the overload of Replace() that takes in a MatchEvaluator delegate to determine the replacement from the matched string.

Regex re = new Regex("First Pattern|Second Pattern|Super(Mega)*Delux", RegexOptions.IgnoreCase);

source = re.Replace(source, delegate(Match m)
{
    string value = m.Value;

    if(value.Equals("first pattern", StringComparison.OrdinalIgnoreCase)
    {
        return "1st";
    }
    else if(value.Equals("second pattern", StringComparison.OrdinalIgnoreCase)
    {
        return "2nd";
    }
    else
    {
        return "";
    }
});

Of course this falls apart if latter patterns need to be able to match on the result of earlier replacements.

Upvotes: 2

manojlds
manojlds

Reputation: 301127

Have a look at this post which talks about searching a stream using regular expressions rather than having to store in a string which consumes memory:

http://www.developer.com/design/article.php/3719741/Building-a-Regular-Expression-Stream-Search-with-the-NET-Framework.htm

Upvotes: 2

Related Questions