Reputation: 2367
I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ­",":"), ' ', $string);
$string = str_ireplace(array("­","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim
function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: «
and »
. This caused problems with another special character:
When I pass the word België
into the function, the ë
gets corrupted and elastic throws an error.
«
and »
and preserve ë
?Upvotes: 1
Views: 366
Reputation: 522635
trim
is not encoding aware and just looks at individual bytes. If you tell it to trim '«»'
, and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB
(where C2
is redundant, so AB BB C2
are the actual search terms). "ë" in UTF-8 is C3 AB
, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)
Upvotes: 4