Reputation: 1
I have a UTF-8 JSON which contains escaped Unicode characters. For example:
{
"description": "This is an ellipsis: \u2026"
}
The JSON is parsed with Jackson. At a later stage, the strings are converted into bytes for a ISO-8859-15/Latin9 platform:
final byte[] d = description.getBytes(Charset.forName("ISO-8859-15"));
Obviously, the ellipsis character (…) is not in the ISO-8859-15/Latin9 character set (see https://www.charset.org/charsets/iso-8859-15).
I am looking for a way to convert non-supported Unicode characters to a sensible ISO-8859-15/Latin9-supported character or set of characters. Here, I would expect three dots.
Examples of other characters which are present in the input and an expected counterpart:
\u2013 -> – -> -
\u2018 -> ‘ -> '
\u2019 -> ’ -> '
\u201c -> “ -> "
\u201d -> ” -> "
\u2022 -> • -> .
Ideally, this is done without having to enumerate all possible inputs and outcomes. That is, not by myself, as I don't want to maintain a rather extensive mapping table.
Is there a JDK class or external library out there which can do the conversion?
Upvotes: 0
Views: 174
Reputation: 19
This can be done using the Transliteration API of Unicode's ICU project.
E.g. in Java or Kotlin using gradle, add the correct dependency:
implementation("com.ibm.icu:icu4j:75.1")
then in code, something like this:
val transliterator = Transliterator.getInstance("Latin-ASCII")
val text = transliterator.transliterate(yourTextString)
That should be it. The mappings for the Latin-ASCII transliterator can be found at https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/Latin_ASCII.txt if you want to verify if it is sensible enough for your purposes.
Upvotes: 1