Rajat Gupta
Rajat Gupta

Reputation: 26617

UTF8 string to byte[] with each character as single byte

I would like to take input from user as UTF8 string & then detect the language of the String & store the string as a compressed byte[]. If all characters are not of same language, then it is not a valid input. After getting a valid input from user I would like to store this input string as bytes array.

If user entered string with non english characters then each character would occupy more than 1 byte, so I would like to store the language of the string & then store each character in a single byte(i guess it would now be possible to store the character in single byte by storing just difference from start code point of that language & since all characters are from same language & may(!?) therefore fit in single byte capacity because of small range!?). This is how I compress each character to fit in single byte.

Is this a correct approach? If yes how can I detect the language of the characters in the string ?

Upvotes: 0

Views: 861

Answers (1)

Bobulous
Bobulous

Reputation: 13169

Take a look at the Character.UnicodeBlock class, which provides the static method of(char) and of(int) to detect the Unicode block of a character. This will tell you whether a character is, for example, from the ARABIC block or from the BASIC_LATIN block.

However, notice that there are several *LATIN* blocks, and many languages need to use characters from several blocks. So working out what language is being provided to you is going to be very hard work. I can think of no way to automatically detect this.

Also bear in mind that many Unicode blocks are enormous, and there's no way that you'll be able to fit all valid characters from a single language into just one byte. (Take a look at the Unicode 6.1 Character Code Charts to appreciate just how vast Unicode is.) So, honestly, you are not going to be able to compress every character into a single byte.

UTF-8 is the result of years of internationalization standards, and it's probably the best option for any software which needs to represent multiple languages. Trying to produce something more efficient will probably cost you a huge amount of time, and result in only small gains.

Upvotes: 1

Related Questions