PHP Regex error on right-double-quotation-mark

Question

I've got a function that finds and extracts "strips" of 3 words from a longer string into an array. Punctuation marks next to words should be included in the word (e.g. a word followed by a comma should be treated as a single word).

It works fine except on one UTF-8 character - a Double Right Quotation Mark (U+201D - ”).

Am I doing something wrong in my regex or is this a PHP bug?

The regex is:

$myarray = preg_match_all(
    "/(\S)*(\s)(\S)*(\s)(\S)*(\s)/",
    $incomingstring,
    $output, 
    PREG_PATTERN_ORDER);

Strangely the regex has no problems with Double Left Quotation Marks (U+201C - “) or some other unicode characters I tried.

Niet the Dark Absol · Accepted Answer

When treated as single-byte characters, ” is seen as 0xE2, 0x80, 0x9D

Similarly, “ becomes 0xE2, 0x80, 0x9C

The difference is between the last byte there. In one case you get 0x9C, the other 0x9D. In Windows-1252 encoding (which is the common default, often mis-labelled as ISO-8859-1), 0x9C is œ, but 0x9D is not defined. This leads to unpredictable behaviour regarding \S and \s, causing your regex to break.

The solution, as hindmost pointed out in a comment, is to use the u modifier to tell your regex to work in UTF-8 instead of single bytes.

PHP Regex error on right-double-quotation-mark

Answers (1)

Related Questions