Reputation: 57916
I have the following code:
$string = $this->getTextFromHTML($html);
echo mb_detect_encoding($string, 'ASCII,UTF-8,ISO-8859-1');
$stringArray = mb_split('\W+', $string);
$cleaned = array();
foreach($stringArray as $v) {
$string = trim($v);
if(!empty($string))
array_push($cleaned, $string);
}
echo mb_detect_encoding($stringArray[752], 'ASCII,UTF-8,ISO-8859-1');
The above returns:
// UTF-8
// ASCII
What part of my code is turning my string into ASCII
? Or am I detecting the encoding incorrectly?
Upvotes: 3
Views: 394
Reputation: 20889
As @Phylogenesis mentioned in the comments, ASCII characters under 0x7F are valid UTF-8. Unless you have a byte order mark in your data, the text is both valid ASCII and UTF-8. You've specified that ASCII is an option before UTF-8, so it is returned.
For example: https://ideone.com/DupS4A
<?php
$str = "apple";
// Returns ASCII
var_dump(mb_detect_encoding($str, "ASCII, UTF-8"));
// 0xEFBBBF is the byte order mark in UTF-8
$str_with_bom = chr(0xEF) . chr(0xBB) . chr(0xBF) . "apple";
// Returns UTF-8
var_dump(mb_detect_encoding($str_with_bom, "ASCII, UTF-8"));
Upvotes: 2
Reputation: 522042
Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding
doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.
Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.
Upvotes: 3