Hasan Tuncay
Hasan Tuncay

Reputation: 1118

UTF-8 -- ISO 8859-1 mapping tool

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else!

Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?

Upvotes: 3

Views: 1954

Answers (1)

McDowell
McDowell

Reputation: 108979

For the specific example, you can:

  • decompose the letters and diacritics using compatibility form Unicode normalization
  • instruct the encoder to drop unsupported characters (the diacritics)

Example:

ByteArrayOutputStream out = new ByteArrayOutputStream();
// create encoder
CharsetEncoder encoder = StandardCharsets.ISO_8859_1.newEncoder();
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
// write data
String ikea = "\u0130KEA";
String decomposed = Normalizer.normalize(ikea, Form.NFKD);
CharBuffer cbuf = CharBuffer.wrap(decomposed);
ByteBuffer bbuf = encoder.encode(cbuf);
out.write(bbuf.array());
// verify
String decoded = new String(out.toByteArray(), StandardCharsets.ISO_8859_1);
System.out.println(decoded);

You're still transcoding from a character set that defines 109,384 values (Unicode 6) to one that supports 256 so there will always be limitations.

Also consider a more sophisticated transformation API like ICU for features like transliteration.

Upvotes: 1

Related Questions