John
John

Reputation: 13757

Will comparing the binary data of a string with an unknown character encoding validate what its encoding is?

I need to automatically determine the character encoding of strings from email content and headers. For the most part this isn't an issue however there is an occasional email with content and/or a header that has an oddball character such as an en dash. Now I received an answer that technically seems to work if I statically test it on a specific header for a specific email however that blatantly ignores the fact that importing email needs to be a completely automated process in which case I am utterly unable to automatically determine the string's character encoding.

I've started with the basics such as detecting common trouble characters that seem to guarantee a character encoding issue will occur. However strpos('en dash: –', '–') works fine while intentionally / manually testing though it fails outright when added directly to the automated process. I'm going to guess that the issue there is that the string parameters have a UTF-8 encoding while the automated process is testing a string that isn't yet UTF-8 and thus internally the same character isn't using the same subset of code (via character encoding).

So my second attempt was mb_detect_encoding's second parameter can be an array. So I tried the following:

$encodings = array('UTF-8','UCS-4','UCS-4BE','UCS-4LE','UCS-2','UCS-2BE','UCS-2LE','UTF-32','UTF-32BE','UTF-32LE','UTF-16','UTF-16BE','UTF-16LE','UTF-7','UTF7-IMAP','ASCII','EUC-JP','SJIS','eucJP-win','SJIS-win','ISO-2022-JP','ISO-2022-JP-MS','CP932','CP51932','SJIS-mac','SJIS-Mobile#DOCOMO','SJIS-Mobile#KDDI','SJIS-Mobile#SOFTBANK','UTF-8-Mobile#DOCOMO','UTF-8-Mobile#KDDI-A','UTF-8-Mobile#KDDI-B','UTF-8-Mobile#SOFTBANK','ISO-2022-JP-MOBILE#KDDI','JIS','JIS-ms','CP50220','CP50220raw','CP50221','CP50222','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','byte2be','byte2le','byte4be','byte4le','BASE64','HTML-ENTITIES','7bit','8bit','EUC-CN','CP936','GB18030','HZ','EUC-TW','CP950','BIG-5','EUC-KR','UHC','ISO-2022-KR','Windows-1251','Windows-1252','CP866','KOI8-R','KOI8-U','ArmSCII-8');
$encoding = mb_detect_encoding($s, $encodings, true);
$compare = mb_convert_encoding($s, 'UTF-8', $encoding);

foreach ($encodings as $k1)
{
 if (mb_convert_encoding($s, 'UTF-8', $k1) === $s) {$encoding = $k1; break;}
}

Unfortunately that seemed to result in the same failure based on what I presume was the same underlying issue.

So my third idea I'm looking for some more experienced validation. I could convert the string down in its binary form (ones and zeroes, not binary data). Then I could try converting the string and then converting that second string to binary to compare the two binary versions; if they === match then I might have determined the correct character encoding?

Now I can easily try this with this answer from an unrelated thread however I'm not certain if this is a valid idea or not. This is all intended to answer my question:

How can I determine the actual character encoding of a string in order to convert it to UTF-8 with fully automated validation without corrupting data?

By validation I'm talking about stuff like comparing the binary data though again, I'm not certain if that is a valid approach or not. I do know that I absolutely hate en dashes though.

Upvotes: 1

Views: 198

Answers (1)

AmigoJack
AmigoJack

Reputation: 6174

The answer won't change: it's impossible. You have to rely on external information which encoding is used on text.

Guessing an encoding can horribly go wrong:

  • Based on the order in which you test against it can either turn out as i.e. ASCII or UTF-8 or Windows-1252, just because so far it fits in. Your list is questionable, because it may match Base64 which is not even a text encoding.
  • If the source is not properly encoded itself then guessing its encoding will most likely exclude the correct one. And guess a wrong one. Which makes things worse.
  • Many encodings share the same area: the source can either fit i.e. Windows-1252 or Windows-1251 and even detecting the lexical sense of the text cannot guarantee which of both is correct.

Also: ones and zeroes are binary. PHP strings are only byte arrays, so they're binary to begin with. How they're interpreted relies on you: if your code is $text= "グリーン"; then it's up to which encoding your PHP text file has and how your PHP defaults are set. There is no "internal ... character", only bytes. Which is also the reason why there are functions which operate on bytes (i.e. strlen()) and on a specific text encoding (i.e. mb_strlen()).

If you hate single characters or not: they can be easily used as what they are: characters in texts. And has its own valid meaning in contrast to and and -; don't replace it by personal opinion, because that could corrupt a context's meaning. It's like ignoring the fact that A and Α and A are all different characters. You might want to look up the difference between homoglyphs and synoglyphs - the latter is your current perspective.

You may ask "And in which encoding does PHP interpret the scripts?" Luckily ASCII is for most encodings the most common denominator, so interpreting the first bytes of a file as such to search for <?php (all these are ASCII characters, so for PHP code itself it doesn't matter if it is effectively UTF-8 or ISO-8859-1 or Shift-JIS) will only fail when the document is encoded in i.e. UTF-16 - in that case you must set your PHP defaults to that encoding. Which again proves: text encodings must be told outside of the text.

Upvotes: 1

Related Questions