Reputation: 3640
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to 哈瓦那 (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not 哈瓦那.
Thanks.
Upvotes: 1
Views: 4792
Reputation: 718678
I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to 哈瓦那 (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.
Upvotes: 2
Reputation: 5210
Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.
Upvotes: 0
Reputation: 691625
When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.
Upvotes: 5