Julien
Julien

Reputation: 5779

std::regex_replace gives me unexpected result

I'm using std::regex_replace in a C++ Windows project (Visual Studio 2010). The code looks like this:

std::string str("http://www.wikipedia.org/");
std::regex fromRegex("http://([^@:/]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::string fmt("https://$1wik$2.org/");
std::string result = std::regex_replace(str, fromRegex, fmt);

I would expect result to be "https://www.wikipedia.org/", but I get "https://www.wikipedia.wikipedia.org/".

A quick check with sed gives me the expected result

$ cat > test.txt
http://www.wikipedia.org/
$ sed 's/http:\/\/([^@:\/]+\.)?wik(ipedia|imedia)\.org\//https:\/\/$1wik$2.org\//' test.txt
http://www.wikipedia.org/

I don't get where the difference comes from. I checked the flags that can be used with std::regex_replace, I didn't see one that would help in this case.

Update

These variants work fine:

std::regex fromRegex("http://([^@:/]+\\.)wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://((?:[^@:/]+\\.)?)wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([a-z]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^a]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);

bu not these:

std::regex fromRegex("http://([^1-9]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^@]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^:]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);

It makes no sense to me...

Upvotes: 2

Views: 646

Answers (1)

Pete Becker
Pete Becker

Reputation: 76315

There's a subtle error in the regular expression. Don't forget that escape sequences in string literals are expanded by the compiler. So change

"http://([^@:/]+\.)?wik(ipedia|imedia)\.org/"

to

"http://([^@:/]+\\.)?wik(ipedia|imedia)\\.org/"

That is, replace each of the two single backslashes with a pair of backslashes.

EDIT: this doesn't seem to affect the problem, though. On the two implementations I tried (Microsoft and clang), the original problem doesn't occur, with our without the doubled backslashes. (Without, you get compiler warnings about an invalid escape sequence, but the resulting . wildcard matches the . character in the target sequence, just as a \. would)

Upvotes: 3

Related Questions