Reputation: 95326
A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.
We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.
Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).
EDIT: This paper on detecting character encodings/sets is pretty interesting: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).
Upvotes: 7
Views: 2614
Reputation: 10502
Visual Studio Code uses jschardet, which returns a guess and a confidence level. It's all open source, so you can inspect the code.
https://github.com/microsoft/vscode/issues/101930#issuecomment-655565813
Upvotes: 2
Reputation: 9
we just found a solution to this Basically, when you don't know the encoding of a file/stream/source you need to check the entire file and/or look for portions of texts to see if you get UTF-8 matches. I see this similar to what some antiviral products does, checking for portions of known viral sub-strings
Maybe I'd suggest you apply call to a function similar to what we did when reading the file/stream, line by line to determine whether UTF-8 encoding is found or not
Please refer to our post below
Ref. - https://stackoverflow.com/questions/17283872/how-to-detect-utf-8-based-encoded-strings
Upvotes: 1
Reputation: 306
If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:
unc ::IsUTF8(unc *cpt)
{
if (!cpt)
return 0;
if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80)
&& ((*(cpt + 3) & 0xC0) == 0x80))
return 4;
}
else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80))
return 3;
}
else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
if ((*(cpt + 1) & 0xC0) == 0x80)
return 2;
}
return 0;
}
If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.
Upvotes: 8