Friet Stoofvlees
Friet Stoofvlees

Reputation: 1

How to translate unicode characters to ISO-8859-15/Latin9 variant?

I have a UTF-8 JSON which contains escaped Unicode characters. For example:

{
    "description": "This is an ellipsis: \u2026"
}

The JSON is parsed with Jackson. At a later stage, the strings are converted into bytes for a ISO-8859-15/Latin9 platform:

final byte[] d = description.getBytes(Charset.forName("ISO-8859-15"));

Obviously, the ellipsis character (…) is not in the ISO-8859-15/Latin9 character set (see https://www.charset.org/charsets/iso-8859-15).

I am looking for a way to convert non-supported Unicode characters to a sensible ISO-8859-15/Latin9-supported character or set of characters. Here, I would expect three dots.

Examples of other characters which are present in the input and an expected counterpart:

\u2013 -> – -> -
\u2018 -> ‘ -> '
\u2019 -> ’ -> '
\u201c -> “ -> "
\u201d -> ” -> "
\u2022 -> • -> .

Ideally, this is done without having to enumerate all possible inputs and outcomes. That is, not by myself, as I don't want to maintain a rather extensive mapping table.

Is there a JDK class or external library out there which can do the conversion?

Upvotes: 0

Views: 174

Answers (1)

Guyndalf
Guyndalf

Reputation: 19

This can be done using the Transliteration API of Unicode's ICU project.

E.g. in Java or Kotlin using gradle, add the correct dependency:

implementation("com.ibm.icu:icu4j:75.1")

then in code, something like this:

val transliterator = Transliterator.getInstance("Latin-ASCII")
val text = transliterator.transliterate(yourTextString)

That should be it. The mappings for the Latin-ASCII transliterator can be found at https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/Latin_ASCII.txt if you want to verify if it is sensible enough for your purposes.

Upvotes: 1

Related Questions