paraguma
paraguma

Reputation: 187

how to determine text encoding

I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding.

I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other encoding. How do I guess them.

Anybody can help me? thanks in Advance.

Upvotes: 5

Views: 638

Answers (3)

Todd Owen
Todd Owen

Reputation: 16178

This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).

Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.

Upvotes: 4

Hendrik
Hendrik

Reputation: 2031

If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.

For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.

The most common solution is to let the user select the encoding if you cannot detect it.

Upvotes: 0

Álvaro González
Álvaro González

Reputation: 146350

Short answer is: you cannot.

Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.

This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.

Upvotes: 0

Related Questions