VinS
VinS

Reputation: 194

How to emulate a negative lookbehind assertions in Qt regexp?

I am writing a program using Qt 4.6 and I need to capture all occurences of non-range literals from expressions like "SUM(A1:A3)+B1-B3+SUM(D1:D3)/COUNT(D1:D3)", i.e. B1, B3, but not A1, A3, D1, D3. I have tried to use QRegExp, but it doesn't support negative lookbehind assertions, so I can't exclude literals like A3, D3. My regexp (?<!:)([A-Z]{1,4}[1-9]\\d{0,3})(?!:) doesn't work. I need your input. Thanks.

Upvotes: 2

Views: 2490

Answers (3)

Becheru Petru-Ioan
Becheru Petru-Ioan

Reputation: 321

QRegularExpression in Qt 5.0 supports (fixed length) lookbehind assertions.

https://bugreports.qt-project.org/browse/QTBUG-2371 was closed on Mar 22,2012. Qt 5.0 was released on Dec 19,2012.

Upvotes: 0

Andrew Cheong
Andrew Cheong

Reputation: 30273

For solutions that require lookarounds on engines that don't support lookarounds, I have found only one alternative: "combinatoric brute-forcing" as I call it, though I'm sure there's a more technical name. One example is here: Validate proxy URL using XML regex pattern.

But it doesn't work when you need to find more than one occurrence. You've probably tried something like this yourself:

/[^:]\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:]/

(I added the \b to be more safe. Also, remember to escape the backslashes again.)

And if you did try this, then you notice the problem: the first match is found after reading up to +B1-; so, since the - has already been read, the next cell reference B3 cannot be matched, since there is no appropriate character for [^:].

To redescribe the problem, the above regex can only catch every other match in a consecutive chain of cell references, e.g. for the string,

(A1+A2+A3+A4+A5+A6)/(B1+B2+B3+B4+B5+B6)
^^^^  ^^^^  ^^^^    ^^^^  ^^^^  ^^^^

...only the indicated parts will match, also shown here.

There is no way to get around this in a single regex. Your alternatives:

  1. Use a non-regex approach.

  2. If you must use regex for some reason, then probably your only hope is to be able to use at least two regexes (e.g. use the first to insert spaces around all cell-reference-like strings, so that you have no consecutive chains of cell references).

  3. Or, in the improbable case that it's good enough to capture them into submatches, i.e. accessible via .cap(1), .cap(2), etc. you may be able to do the following.


/[^:]\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:](?:(\b([A-Z]{1,4}[1-9]\d{0,3})\b[^:]))?))?))?))?))?))?/

Well, that's impossible to read, so here's a more readable version. Pretend XY expands to our cell reference expression, \b([A-Z]{1,4}[1-9]\d{0,3})\b. Then, the above is the same as:

/[^:]XY[^:](?:(XY[^:](?:(XY[^:](?:(XY[^:](?:(XY[^:](?:(XY[^:]))?))?))?))?))?))?/

See the pattern? Before we go further, you can see that this regex matches our example perfectly. The drawback is that you can only handle a chain of consecutive cell references as long as you define your regex. The above can handle 7, and beyond that it breaks.

Upvotes: 0

Ferdinand Beyer
Ferdinand Beyer

Reputation: 67137

In your case you could use

(?:^|[^:])\b([A-Z]{1,4}[1-9]\d{0,3})\b(?!:)

The first group matches the empty string at the beginning or any character except the colon. I also added word boundaries \b so that the pattern won't match things like A4a.

Often times it is simpler to write "positive" patterns. For example, using

(...)(:...)?

with ... denoting your [A-Z] pattern to match cell references, you can match ranges and non-ranges in one pass, then discard all ranges when looping over the results. You can easily detect whether a match is a range by checking if the second capture group is empty.

Upvotes: 2

Related Questions