What changes my UTF-8 string to ASCII?

Question

I have the following code:

$string = $this->getTextFromHTML($html);

echo mb_detect_encoding($string, 'ASCII,UTF-8,ISO-8859-1');

$stringArray = mb_split('\W+', $string);
$cleaned = array();
foreach($stringArray as $v) {
    $string = trim($v);
    if(!empty($string))
        array_push($cleaned, $string);
}

echo mb_detect_encoding($stringArray[752], 'ASCII,UTF-8,ISO-8859-1');

The above returns:

// UTF-8
// ASCII

What part of my code is turning my string into ASCII? Or am I detecting the encoding incorrectly?

deceze · Accepted Answer

Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.

Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.

What changes my UTF-8 string to ASCII?

Answers (2)

Related Questions