Reputation: 4755
I need to detect a file type and encoding in Ruby.
I'm currently using libmagic through the magic gem, but it has one problem: it doesn't detect UTF-16 files if they don't have BOM. This is an example of such file.
$ file -i text_without_bom.txt
text_without_bom.txt: application/octet-stream; charset=binary
Is there any other library or method I could use that would detect UTF-16 files properly?
P. S. Also tried rchardet and charlock_holmes, without much luck.
Upvotes: 1
Views: 892
Reputation: 369458
It is impossible to detect the encoding of a text file reliably. You have to be told out-of-band what the encoding is.
The reason for this is simple: there are tons of 8-bit encodings. In those encodings, every combination of 8 bits is a valid character. Since every combination of 8 bits is a valid character in every 8 bit encoding, any arbitrary text file, and in fact any arbitrary file at all is a valid text file in any 8 bit encoding.
For example, in ISO 8859-15 0xA4
is the Euro sign €
. In ISO 8859-1, CP1252, and Unicode, 0xA4
is the international currency sign ¤
. So, if you have a file that contains 0xA4
, you cannot know if it is ISO 8859-15, ISO 8859-1, CP1252, one half of a character in UTF-16, one quarter of a character in UTF-32, the middle of a multibyte sequence in UTF-8, or one of many other possibilities.
Upvotes: 2
Reputation: 1
You can always cut BOM off and process file without it. This describes how it can be done.
Upvotes: -1