Reputation: 775
What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?
Is there some class that auto-detects the encoding?
Upvotes: 3
Views: 3544
Reputation: 2658
One of the ways(brute force) of doing can be
Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.
Upvotes: 0
Reputation: 153919
This is impossible in the general case. If the file contains exactly the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of the ISO 8859 variants. Several heuristics can be used as a guess, however: read the first "page" (512 bytes or so), then, in the following order:
'\0', other, '\0', other
UTF16BE
other, '\0', other, '\0'
UTF16LE
'\0', '\0', '\0', other
UTF32BE
other, '\0', '\0', '\0'
UTF32RLE
But as I said, it's not 100% sure.
(PS. How do I format a table here. The text in point 2 is declared as an HTML table, but it doesn't seem to be showing up as one.
Upvotes: 1
Reputation: 6778
I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.
Upvotes: 6