Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6594

Why doesn't (...)? regular expression capture the string?

I have a code where a QString is being modified using a regular expression:

QString str; // str is the string that shall be modified
QString pattern, after; // pattern and after are parameters provided as arguments

str.replace(QRegularExpression(pattern), after);

Whenever I need to append something to the end of the string I use the arguments:

QString pattern("$");
QString after("ending");

Now I have a case where the same pattern is being applied two times, but it shall append the string only once. I expected that this should work (I assume that the initial string doesn't end on "ending"):

QString pattern("(ending)?$");
QString after("ending");

But if applied twice this pattern produces double ending: "<initial string>endingending".

Looks like the ()? expression is lazy, and it captures the expression in parentheses if I force it with a sub-expression before:

QString pattern("string(ending)?$");
QString after("ending");

QString str("Init string");
str.replace(QRegularExpression(pattern), after);
// str == "Init ending"

What's wrong with the "()?" construction (why it is lazy) and how to achieve my goal?

I'm using Qt 5.14.0 (due to some dependencies I cannot use Qt6).

Upvotes: 2

Views: 152

Answers (3)

Marek R
Marek R

Reputation: 38181

Ok I have explanation why it happens (so question in title is answered).

Basically QRegExp::replace or std::regex_replace finds two matches and performs two replacements. One where capture group matches ending and second times when capture group do not match and only ending is matched.

this is result of fact that $ is just an assertion. It doesn't match any character, so can be used multiple times in index based search (when doing replace all).

Here is demo in clean C++ which illustrates the issue:

int main()
{
    std::string s;
    auto r = std::regex{"(ending)?$"};
    auto after = "ending";
    while(getline(std::cin, s)) {
        std::cout << "s: " << s << '\n';
        std::cout << "replace: " << std::regex_replace(s, r, after) << '\n';
        for (auto i = std::regex_iterator{s.begin(), s.end(), r};
            i != decltype(i){};
            ++i) {
            std::cout << "found: " << i->str() << " capture: " << i->str(1);
            std::cout << '\n';
        }
        std::cout << "------------\n";
    }

    return 0;
}

https://godbolt.org/z/e85ajb9aP

Now knowing root cause you can try address this issue.

I come up with: match whole string, using two capture grups one none greedy which will be used in after: re: ^(.*?)(ending)?$ and after: $1ending

https://godbolt.org/z/EKz4fEaT7

Upvotes: 0

peppe
peppe

Reputation: 22826

A pattern like (foo)?$ matches twice at the end of a string ending with foo. You can see easily in action in Perl or https://regex101.com/r/3Oqwo1/1 :

$ perl -E '$_ = "abcfoo"; while ($_ =~ /(foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'

Matched |foo| from 3 to 6
Matched || from 6 to 6

Therefore you'll do two substitutions at the end, neglecting your purpose.

(A way to see this is that patterns match between characters:

            /-----------\
            v           v   first pattern matches here
| a | b | c | f | o | o |
                       ^ ^
                       \-/  second pattern matches here

If the "tail" is fixed-length, you can use a negative lookbehind, like already suggested: (?<!foo)$.

$ perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
# no match
$ perl -E '$_ = "abcfie"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6

Note that there's no .* before, nor ? after the negative lookbehind. If you add them, you'll again break the matching:

$ perl -E '$_ = "abcfie"; while ($_ =~ /.*(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched |abcfie| from 0 to 6
Matched || from 6 to 6

Global matching will happen twice in abcfie, once matching the entire string, and again matching the empty string at the end (look at the offsets). This will result in 2 replacements.

perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6

This will match at the very end of the string, resulting in a replacement that you don't want (string already ends in foo).

Upvotes: 1

Daksh
Daksh

Reputation: 489

What ? in your regex is doing is that it is telling the regex engine that the string can optionally end with ending. Your question is a bit unclear, but if I understand it correctly, what you need instead is a negative lookbehind. Changing your pattern as follows should do the trick:

QString pattern(".*(?<!ending)$");

This makes sure that it only matches strings that don't originally end with ending. You can play with it here.

Upvotes: 0

Related Questions