Reputation: 10351
I have this simple code to count punctuation in a string. ie "there's 2 commas, 3 semi-colons..." etc. But when it sees an em-dash (—) it doesn't work. Note that it is not a hyphen (-), I don't care about those.
Is there something special about em-dash that makes it weird in a PHP string and/or as an array key? Maybe a weird unicode problem?
$punc_counts = array(
"," => 0,
";" => 0,
"—" => 0, //exists, really!
"'" => 0,
"\"" => 0,
"(" => 0,
")" => 0,
);
// $str is a long string of text
//remove all non-punctuation chars from $str (works correctly, keeping em-dashes)
$puncs = "";
foreach($punc_counts as $key => $value)
$puncs .= $key;
$str = preg_replace("/[^{$puncs}]/", "", $str);
//$str now equals something like:
//$str == ",;'—\"—()();;,";
foreach(str_split($str) as $char)
{
//if it's a puncutation char we care about, count it
if(isset($punc_counts[$char]))
$punc_counts[$char]++;
else
print($char);
}
print("<br/>");
print_r($punc_counts);
print("<br/>");
The code above prints:
——
Array ( [,] => 2 [;] => 3 [—] => 0 ['] => 1 ["] => 1 [(] => 2 [)] => 2 )
Upvotes: 1
Views: 203
Reputation: 23346
It's probably not multibyte compatible. There is a useful comment on the PHP doc page for str_split
that suggests the following:
function str_split_unicode($str, $l = 0) {
if ($l > 0) {
$ret = array();
$len = mb_strlen($str, "UTF-8");
for ($i = 0; $i < $len; $i += $l) {
$ret[] = mb_substr($str, $i, $l, "UTF-8");
}
return $ret;
}
return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
Upvotes: 1