Reputation: 13
When I run this code
$string='<p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>';
echo substr(strip_tags(trim(html_entity_decode($string, ENT_COMPAT, 'UTF-8'))), 0, 14);;
i get this result.
Şelamiİnnş�
what is my mistake ?
Upvotes: 1
Views: 1445
Reputation: 98005
Firstly, always break your problem down into smaller parts to see where it's going wrong:
$string=html_entity_decode($string, ENT_COMPAT, 'UTF-8');
echo $string, "\n";
$string = trim($string);
echo $string, "\n";
$string = strip_tags($string);
echo $string, "\n";
$string = substr($string, 0, 14);
echo $string, "\n";
If you run that, you'll see that the problem has nothing to do with strip_tags
, it has to do with substr
.
The reason is very simple: strings in PHP are just a series of bytes; functions like substr
don't count "characters" in any meaningful way. So substr($string, 0, 14)
simply takes the first 14 bytes of the string, which in this case happens to split a "character" which was encoded as more than one byte, using UTF-8.
The most common solution to this is to use mb_substr
(part of the "mbstring" PHP extension) which counts "characters" according to some encoding:
$string = mb_substr($string, 0, 14, 'UTF-8');
echo $string, "\n";
// Şelamiİnnşşasd
Note that this will truncate to 14 Unicode code points, so can still do odd things like chop an accent off a letter if it's been encoded using a "combining diacritic".
An alternative in some cases would be to use grapheme_substr
(part of the "intl" extension) which splits on "graphemes", which are intended to be roughly what people would think of as a "character" or "letter". In this case, it gives the same result:
$string = grapheme_substr($string, 0, 14, 'UTF-8');
echo $string, "\n";
// Şelamiİnnşşasd
But in other cases, it might not:
$string = 'noël';
echo mb_substr($string, 0, 3, 'UTF-8'), "\n"; // noe
echo grapheme_substr($string, 0, 3), "\n"; // noë
Upvotes: 1
Reputation: 1619
You should use multi-byte substr()
function.
Try
<?php
$string = '<p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p><p>Şelamiİnnşşasdüğ213,123wqeq.weqw.rqasd</p>';
echo mb_substr(strip_tags(trim(html_entity_decode($string, ENT_COMPAT, 'UTF-8'))), 0, 14);;
?>
Upvotes: 0