Nordlöw
Nordlöw

Reputation: 12138

Detecting Multi-Byte Character Encodings

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.

Upvotes: 2

Views: 1501

Answers (3)

Éric Malenfant
Éric Malenfant

Reputation: 14148

ICU does character set detection. You must note that, as the ICU documentation states:

This is, at best, an imprecise operation using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language.

Upvotes: 5

Tobias Langner
Tobias Langner

Reputation: 10828

in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.

This can of course only be a heuristic and you need to scan the whole text.

Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.

Upvotes: 1

Mike DeSimone
Mike DeSimone

Reputation: 42825

If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.

As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.

Upvotes: 2

Related Questions