somshivam
somshivam

Reputation: 789

In Java, How to detect if a string is unicode escaped

I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.

##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling

Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.

Upvotes: 0

Views: 1682

Answers (4)

Igor Rodriguez
Igor Rodriguez

Reputation: 1246

The library ICU4J seems to be what you're looking for. See the Normalization page.

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.

So from a Properties point of view, these two strings are indeed the same.

I believe if you need to differentiate these two, you will need to write your own parser.

It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.

Upvotes: 0

Stephen Ostermiller
Stephen Ostermiller

Reputation: 25524

Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.

As such there are only a few non-ascii characters that can appear in a properties file without being escaped.

To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.

I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.

Upvotes: 2

Ian Roberts
Ian Roberts

Reputation: 122364

You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).

Upvotes: 1

Related Questions