Michel Hua
Michel Hua

Reputation: 1777

Scala convert string between two charsets

I have a misformed UTF-8 string consisting that should be written "Michèle Huà" but outputs as "Michèle HuÃ"

According to this table it is a problem between Windows-1252 and UTF-8 http://www.i18nqa.com/debug/utf8-debug.html

How do I make conversion?

scala> scala.io.Source.fromBytes("Michèle HuÃ".getBytes(), "ISO-8859-1").mkString
res25: String = Michèle HuÃ

scala> scala.io.Source.fromBytes("Michèle HuÃ".getBytes(), "UTF-8").mkString
res26: String = Michèle HuÃ

scala> scala.io.Source.fromBytes("Michèle HuÃ".getBytes(), "Windows-1252").mkString
res27: String = Michèle HuÃ

Thank you

Upvotes: 1

Views: 16416

Answers (1)

Rex Kerr
Rex Kerr

Reputation: 167891

You don't actually have the complete string there, due to an unfortunate issue with one character printing blank. "Michèle Huà" when encoded as UTF-8 but read as Windows-1252 is actually "Michèle Huà", where that last character is 0xA0 (but typically pastes as 0x20, a space).

If you can include that character, you can convert successfully.

scala> fixed = new String("Michèle HuÃ\u00A0".getBytes("Windows-1252"), "UTF-8")
fixed: String = Michèle Huà

Upvotes: 9

Related Questions