Reputation: 9838
I have read some other threads on this subject but I cannot understand what I am doing wrong.
I have a function
public function reEncode($item)
{
if (! mb_detect_encoding($item, 'utf-8', true)) {
$item = utf8_encode($item);
}
return $item;
}
I am writing a test for this. I want to test a string that is not UTF-8
to see if this statement is hit. I am having trouble creating the test string.
$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));
$sanitized = $this->reEncode($contents);
var_dump(mb_detect_encoding($sanitized));
Initially I used file_get_contents
on a file I encoded in sublime with various encodings; Cyrillic (KOI8-R)
, HEX
and DOS (CP 437)
as it has been stated that file_get_contents()
ignores the file encoding. This seems to be true as the characters returned are a jumbled mess.
That said, every time I use mb_detect_encoding()
on these variables, I always get ASCII
or UTF-8
. The statement is never triggered because ASCII
is a subset of UTF-8
.
So I have tried mb_convert_encoding()
and iconv()
to convert a basic string to UTF-16
, UTF-32
, base64
, hex
etc etc but every time mb_detect_encoding()
returns ASCII
or UTF-8
In my tests I want to assert the encoding type before and after this function is called.
$sanitized = $this->reEncode($contents);
$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));
I cannot understand what basic mistake I am doing to constantly get ASCII
or UTF-8
returned from mb_detect_encoding()
.
Upvotes: 2
Views: 525
Reputation: 9838
Ok, so it turns out you must use strict to check or the mb_detect_encoding()
function is next to useless.
$item = mb_convert_encoding('Котёнок', 'KOI8-R');
$sanitized = $this->reEncode($item);
$this->assertEquals('KOI8-R', mb_detect_encoding($item, 'KOI8-R', true));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitised, 'UTF-8', true));
Upvotes: 1