Ira Baxter
Ira Baxter

Reputation: 95326

Detect UTF-8 encoding (How does MS IDE do it)?

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.

We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.

Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).

EDIT: This paper on detecting character encodings/sets is pretty interesting: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).

Upvotes: 7

Views: 2614

Answers (3)

jedmao
jedmao

Reputation: 10502

Visual Studio Code uses jschardet, which returns a guess and a confidence level. It's all open source, so you can inspect the code.

https://github.com/microsoft/vscode/issues/101930#issuecomment-655565813

Upvotes: 2

Diego Sendra
Diego Sendra

Reputation: 9

we just found a solution to this Basically, when you don't know the encoding of a file/stream/source you need to check the entire file and/or look for portions of texts to see if you get UTF-8 matches. I see this similar to what some antiviral products does, checking for portions of known viral sub-strings

Maybe I'd suggest you apply call to a function similar to what we did when reading the file/stream, line by line to determine whether UTF-8 encoding is found or not

Please refer to our post below

Ref. - https://stackoverflow.com/questions/17283872/how-to-detect-utf-8-based-encoded-strings

Upvotes: 1

Jeremy Griffith
Jeremy Griffith

Reputation: 306

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

unc ::IsUTF8(unc *cpt)
{
    if (!cpt)
        return 0;

    if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80)
         && ((*(cpt + 3) & 0xC0) == 0x80))
            return 4;
    }
    else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80))
            return 3;
    }
    else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
        if ((*(cpt + 1) & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

Upvotes: 8

Related Questions