myol
myol

Reputation: 9838

Testing non UTF-8 string

I have read some other threads on this subject but I cannot understand what I am doing wrong.

I have a function

public function reEncode($item)
{
    if (! mb_detect_encoding($item, 'utf-8', true)) {
        $item = utf8_encode($item);
    }

    return $item;
}

I am writing a test for this. I want to test a string that is not UTF-8 to see if this statement is hit. I am having trouble creating the test string.

$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));

$sanitized = $this->reEncode($contents);
var_dump(mb_detect_encoding($sanitized));

Initially I used file_get_contents on a file I encoded in sublime with various encodings; Cyrillic (KOI8-R), HEX and DOS (CP 437) as it has been stated that file_get_contents() ignores the file encoding. This seems to be true as the characters returned are a jumbled mess.

That said, every time I use mb_detect_encoding() on these variables, I always get ASCII or UTF-8. The statement is never triggered because ASCII is a subset of UTF-8.

So I have tried mb_convert_encoding() and iconv() to convert a basic string to UTF-16, UTF-32, base64, hex etc etc but every time mb_detect_encoding() returns ASCII or UTF-8

In my tests I want to assert the encoding type before and after this function is called.

$sanitized = $this->reEncode($contents);

$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));

I cannot understand what basic mistake I am doing to constantly get ASCII or UTF-8 returned from mb_detect_encoding().

Upvotes: 2

Views: 525

Answers (1)

myol
myol

Reputation: 9838

Ok, so it turns out you must use strict to check or the mb_detect_encoding() function is next to useless.

$item = mb_convert_encoding('Котёнок', 'KOI8-R');

$sanitized = $this->reEncode($item);

$this->assertEquals('KOI8-R', mb_detect_encoding($item, 'KOI8-R', true));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitised, 'UTF-8', true));

Upvotes: 1

Related Questions