R.Almoued
R.Almoued

Reputation: 499

Java encoding - corrupted French characters

I have a system, where I got French Text from third party, but I am facing hard time to get it readable.

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

I tried all possible combinations to convert the string using encoding of UTF-8 and ISO-8859-1

String frenchReceipt = "RETIR�E"; // The original Text should be "RETIRÉE"

byte[] b1 = new String(frenchReceipt.getBytes()).getBytes("UTF-8"); 
System.out.println(new String(b1));  // RETIR�E

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1"); 
System.out.println(new String(b2));  // RETIR�E

byte[] b3 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b3));  // RETIR?E 

byte[] b4 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes(); 
System.out.println(new String(b4));  //RETIR?E

byte[] b5 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("UTF-8"); 
System.out.println(new String(b5));  //RETIR�E

byte[] b6 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("ISO-8859-1"); 
System.out.println(new String(b6));  //RETIR?E

byte[] b7 = new String(frenchReceipt.getBytes(), "UTF-8").getBytes("UTF-8"); 
System.out.println(new String(b7));  //RETIR�E

byte[] b8 = new String(frenchReceipt.getBytes(), "ISO-8859-1").getBytes("ISO-8859-1"); 
System.out.println(new String(b8));  //RETIR�E

As you see nothing fix the problem.

Please advise.

Update: The third -party partner confirmed that data sent to my application in "ISO-8859-1" Encoding

Upvotes: 1

Views: 1111

Answers (1)

Oleks
Oleks

Reputation: 1051

� is just a replacement character (EF|BF|BD UTF-8) and used to indicate problems when a system is unable to render a correct symbol. It means that you have no chance to convert � into É.

frenchReceipt doesn't contain any byte sequence which could be converted into É because of the declaration:

String frenchReceipt = "RETIR�E";

Your code snippet below should work pretty fine but you have to use the correct byte source.

byte[] b2 = new String(frenchReceipt.getBytes()).getBytes("ISO-8859-1");
System.out.println(new String(b2));

So if you read "RETIRÉE" by bytes from a data source and get 52|45|54|49|52|C9|45 (ISO-8859-1 is expected) then you'll get the proper result. If the data source has already the byte sequence EF|BF|BD the only option you have is search&replace, but in this case, there is no difference between i.e. ä and É.

Update: Since the data are delivered by TCP

new BufferedReader(new InputStreamReader(connection.getInputStream(),"ISO-8859-1"))

solved the issue.

Upvotes: 2

Related Questions