art-solopov
art-solopov

Reputation: 4755

In Ruby, how to reliably detect a file's encoding (including UTF-16 without BOM)?

I need to detect a file type and encoding in Ruby.

I'm currently using libmagic through the magic gem, but it has one problem: it doesn't detect UTF-16 files if they don't have BOM. This is an example of such file.

$ file -i text_without_bom.txt
text_without_bom.txt: application/octet-stream; charset=binary

Is there any other library or method I could use that would detect UTF-16 files properly?

P. S. Also tried rchardet and charlock_holmes, without much luck.

Upvotes: 1

Views: 892

Answers (2)

Jörg W Mittag
Jörg W Mittag

Reputation: 369458

It is impossible to detect the encoding of a text file reliably. You have to be told out-of-band what the encoding is.

The reason for this is simple: there are tons of 8-bit encodings. In those encodings, every combination of 8 bits is a valid character. Since every combination of 8 bits is a valid character in every 8 bit encoding, any arbitrary text file, and in fact any arbitrary file at all is a valid text file in any 8 bit encoding.

For example, in ISO 8859-15 0xA4 is the Euro sign . In ISO 8859-1, CP1252, and Unicode, 0xA4 is the international currency sign ¤. So, if you have a file that contains 0xA4, you cannot know if it is ISO 8859-15, ISO 8859-1, CP1252, one half of a character in UTF-16, one quarter of a character in UTF-32, the middle of a multibyte sequence in UTF-8, or one of many other possibilities.

Upvotes: 2

Vladimir Bogaevsky
Vladimir Bogaevsky

Reputation: 1

You can always cut BOM off and process file without it. This describes how it can be done.

Upvotes: -1

Related Questions