Vajura
Vajura

Reputation: 1132

Is this possible in regex to speed up performance?

Its a more general question so it doesn't really matter what regex expression I use but here is my current one:

(?<name>[^ ^\n]*)[ \n]+OBJECT IDENTIFIER(?<data>([^"]*"[^"]*?")*?[^"]*?)::=[ \n]*\{[ \n]*(?<parent>[^ ^\n]*) (?<oid>\w*)

The main thing in this regex is the OBJECT IDENTIFIER keyword, can I make the regex search for this first ignoring the (?<name>[^ ^\n]*)[ \n]+ in front and then after the regex found OBJECT IDENTIFIER it should apply the whole expression to that location.

Upvotes: 0

Views: 321

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

edit:

I thought that with this kind of pattern the transmission will find the fixed string position and after that the regex engine will search backward the begining of the pattern, but it can't do that. It is not as smart as I thought.

In this case, you can exploit "the first string discrimination" to speed up the recognition writing something like this:

fixed string(?<=non-fixed subpattern, fixed string)

where the fixed string is in the first position in the pattern and, accordingly, allows the transmission to use the Boyer-Moore algorithm to find the position of the fixed string.

or you can try:

(?<=non-fixed subpattern)fixed string 

I can't test it, but it may be possible that the simple fact to put the "non-fixed subpattern" inside a lookbehind allows the transmission to choose what must be the best way to find the match position between testing the lookbehind or finding the fixed string. But I don't know if the transmission is smart enough to do that, it's only a supposition.

As Qtax notices it in a comment, the subpattern ([^"]*"[^"]*?")*?[^"]*? is potentially slow, because you use a lazy quantifier (the one for the group) that precedes [^"]*? that can match the same thing than the begining of the group and that can give an empty match.

Instead of this subpattern, you can use the ::= that comes after and write something like: [^:]+(?>:(?!:=)[^:]*)* that doesn't cause backtracking and use only greedy quantifiers. Note: if you really need to skip ::= that are between double quotes, you can use this: [^:"]+(?>(?::(?!:=)|"[^"]*")[^:]*)*.

An other small optimisation: don't use capturing group when you don't need to capture something, use non-capturing groups instead or atomic groups when it is better and possible.

Conclusion, you can test these patterns (written for the free-spacing mode):

const string Pattern1 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg1 = new Regex(Pattern1, RegexOptions.IgnorePatternWhitespace);

const string Pattern2 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg2 = new Regex(Pattern2, RegexOptions.IgnorePatternWhitespace);

const string Pattern3 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:"]+(?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg3 = new Regex(Pattern3, RegexOptions.IgnorePatternWhitespace);

const string Pattern4 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:"]+ (?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg4 = new Regex(Pattern4, RegexOptions.IgnorePatternWhitespace);

Note: Since I haven't your data under the eyes, I assumed that each named captures can not be empty. If it is not the case, you only need to change some + quantifiers to *.

Note2: It can be interesting if you try these patterns with and without the compiled regex option if you need to use the pattern several times in your code.

old answer:

You don't need to do that because there is a pre-analysis phase before the regex engine work itself.

This phase is called "transmission" and consists of several optimizations. One of these optimizations consists of finding fixed strings from the pattern in the target string first, using the Boyer-Moore algorithm to reduce the regex engine work.

Upvotes: 2

Related Questions