Reputation: 1414

Is it possible to construct a PCRE UTF-8 regex which matches 3 or more non-consecutive UTF codepoints

Good morning, We are trying to match the German string 'DAS tausendschÃ¶ne JungfrÃ¤ulein tausendschÃ§ne' using the C/C++ PCRE regex "\x{00F6}.*\x{00E4}.*\x{00E7}". The PCRE regex matches only once beginning at byte positions 14 and 43. Is our PCRE regex correct or should it be corrected? THank you.

Upvotes: 0

Answers (2)

Frank

Reputation: 1414

Good afternoon, We just discovered the correct PCRE regular expression. (?=.+(\x{00F6})){1}(?=.+(\x{00E4})){1}(?=.+(\x{00E7})){1}

It matches DAS tausendschÃ¶ne JungfrÃ¤ulein ausendschÃ§ne at byte positions (14,16), (25,27) and (42,43). Regards, Frank

Upvotes: 0

Ben

Reputation: 35643

You have misunderstood the returned data.

PCRE returns the starting and ending positions of the match. It has matched only once in each case, but the match includes the whole string matched, including the parts matched by "boring" things like .*.

So for your input string it has matched these parts:

DAS tausendschöne Jungfräulein tausendschçne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmm..

Or equivalently it has matched these bytes:

0         1         2         3         4  4
01234567890123456789012345678901234567890123456789
DAS tausendschÃ¶ne JungfrÃ¤ulein tausendschÃ§ne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...

It is behaving correctly. From http://www.pcre.org/pcre.txt :

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts.

The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on.

Upvotes: 1

Is it possible to construct a PCRE UTF-8 regex which matches 3 or more non-consecutive UTF codepoints

Answers (2)

Related Questions