Reputation: 6008
Is there any way in PHP of detecting the following character �
?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if �
is present in a string. How do I do so with strpos
?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Upvotes: 9
Views: 8888
Reputation: 56351
I use the CUSTOM method (using str_replace
) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3
Upvotes: 0
Reputation: 449395
Converting a UTF-8 string into UTF-8 using iconv()
using the //IGNORE
parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE
and simply test for a failed (empty) iconv
operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.
Upvotes: 19
Reputation: 7277
Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}
Upvotes: 4
Reputation: 15221
As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.
Upvotes: 1