Reputation: 1414
Good morning, We are trying to match the German string 'DAS tausendschöne Jungfräulein tausendschçne' using the C/C++ PCRE regex "\x{00F6}.*\x{00E4}.*\x{00E7}"
. The PCRE regex matches only once beginning at byte positions 14 and 43. Is our PCRE regex correct or should it be corrected? THank you.
Upvotes: 0
Views: 173
Reputation: 1414
Good afternoon, We just discovered the correct PCRE regular expression. (?=.+(\x{00F6})){1}(?=.+(\x{00E4})){1}(?=.+(\x{00E7})){1}
It matches DAS tausendschöne Jungfräulein ausendschçne at byte positions (14,16), (25,27) and (42,43). Regards, Frank
Upvotes: 0
Reputation: 35643
You have misunderstood the returned data.
PCRE returns the starting and ending positions of the match. It has matched only once in each case, but the match includes the whole string matched, including the parts matched by "boring" things like .*
.
So for your input string it has matched these parts:
DAS tausendschöne Jungfräulein tausendschçne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmm..
Or equivalently it has matched these bytes:
0 1 2 3 4 4
01234567890123456789012345678901234567890123456789
DAS tausendschöne Jungfräulein tausendschçne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
It is behaving correctly. From http://www.pcre.org/pcre.txt :
When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts.
The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on.
Upvotes: 1