Vendel Serke
Vendel Serke

Reputation: 135

Java library for converting characters between different encodings

I'm facing the following situation:

We poll some csv data from an external source. The source's response headers doesn't specify that which is the charset, and the data contains some german characters which are showing as a question mark inside a rombus (I know that means the character is not defined in UTF-8).

We want to do some work with this data, and then forward it, but to fix this issue, we want to also encode the erroneous characters to a correct format to show them properly.

I have read already some answers here and most of them suggested using "string.getBytes("encoding")" method, and then create a new string from this with some other encoding.

But from what I understand I need a different thing, as this method just decodes the characters and process the same bytes in respect to another encoding, and some characters get represented with different byte lengths in utf-8 than for example ISO-8859-1 (which I believe the data we are polling is really encoded in) which causes strange characters appearing in the result string so its not really what we want to achieve.

I would need something which can

  1. Get the character from a byte representation in a source encoding
  2. Get the character from a byte representation in the target encoding
  3. Iterate over the decoded byte array and replace all characters byte representation with the representation from the target encoding

After this it would be safe to create a new string from the byte array with the target encoding. So if anyone knows a good library which can do that? I dont want to implement it myself if its already there.

Upvotes: 0

Views: 204

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109532

You have bytes, binary data, that represent text in some character set. For that you need a charset detection. Knowing the Charset you can load it in a java String (Unicode) and save it as bytes given any Charset you want.

If that target Charset cannot represent the Unicode symbol (code point), then one might even determine how that is handled. See CharsetDecoder/CharsetEncoder.

For Charset detection there exist some libraries. I wrote my own for a partial set of charsets & languages. It works best in combination with language detection. For instance for Czech.

See What is the most accurate encoding detector?

Upvotes: 0

Related Questions