Ankita Rajput
Ankita Rajput

Reputation: 1

Convert non english character to english alphabets (those are looking same as alphabets) in java?

If the name is typed for example- "ОХ699" using a different keyboard. as a result, “OX” is flagged as non-English characters, even though they appear to be English characters. so is there any way to convert the characters like these to English characters directly?

i tried following code to convert "OX" to english alphabets "OX":

String subjectString = "ОХ699";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

but it is not converting to english alphabets. Showing output : "699" Expected output : "OX699"

Upvotes: -1

Views: 654

Answers (1)

Mateusz
Mateusz

Reputation: 758

It is not possible using standard lib. You have to implement your own translations. Someone want to translate Р (R in Cyrillic) to p, and someone wants r. Also there is a problem with Chinese characters or emojis.

There is a linux program uni2ascii that do exactly what you want - you can see how it is implemented in other apps https://salsa.debian.org/debian/uni2ascii/-/blob/master/uni2ascii.c (see the extremely big switch statements). There is also Python clone of this app, but very, very simplified - https://github.com/ajanin/uni2ascii/blob/master/uni2ascii/__init__.py#L65 . You can copy that stwich and implement translation in your app.

Or install the uni2ascii on the server and just call it (or call it using jni).

But any way - the common practice is just to ignore and skip non-ascii chars

EDIT: I found java implementation in Lucene engine - https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs

Upvotes: 1

Related Questions