Reputation: 371
I've got a function that finds and extracts "strips" of 3 words from a longer string into an array. Punctuation marks next to words should be included in the word (e.g. a word followed by a comma should be treated as a single word).
It works fine except on one UTF-8 character - a Double Right Quotation Mark (U+201D - ”).
Am I doing something wrong in my regex or is this a PHP bug?
The regex is:
$myarray = preg_match_all(
"/(\S)*(\s)(\S)*(\s)(\S)*(\s)/",
$incomingstring,
$output,
PREG_PATTERN_ORDER);
Strangely the regex has no problems with Double Left Quotation Marks (U+201C - “) or some other unicode characters I tried.
Upvotes: 2
Views: 423
Reputation: 324760
When treated as single-byte characters, ” is seen as 0xE2, 0x80, 0x9D
Similarly, “ becomes 0xE2, 0x80, 0x9C
The difference is between the last byte there. In one case you get 0x9C, the other 0x9D. In Windows-1252 encoding (which is the common default, often mis-labelled as ISO-8859-1), 0x9C is œ, but 0x9D is not defined. This leads to unpredictable behaviour regarding \S
and \s
, causing your regex to break.
The solution, as hindmost pointed out in a comment, is to use the u
modifier to tell your regex to work in UTF-8 instead of single bytes.
Upvotes: 2