sandstrom
sandstrom

Reputation: 15102

Ruby and encoding conversion

I'm importing a CSV file into Ruby (1.8.7). File.open('path/to/file.csv').read returns this in the console:

Stefan,Engstr\232m

The encoding is identified as iso-8859-2 by UniversalDetector (chardet gem).

UniversalDetector::chardet("Stefan,Engstr\232m")
=> {"confidence"=>0.626936305574385, "encoding"=>"ISO-8859-2"} 

Trying to convert the string yields the following:

Iconv.conv("UTF-8", "ISO-8859-2", "Stefan,Engstr\232m")
 => "Stefan,Engstrm"

whereas I would expect:

 => "Stefan,Engström"

Let me know if I should provide more information or elaborate on something.

Upvotes: 3

Views: 1945

Answers (1)

mu is too short
mu is too short

Reputation: 434865

The encoding is probably "Macintosh Roman", a couple other options would be "Mac Central European" and "Mac Icelandic". The \nnn notation uses octal so \232 is 154 in decimal and character 154 is the lower case O-umlaut ("ö") that you're expecting in all three of those encodings; I don't see 154 in any of the Windows codepages or ISO 8859 character sets. I'd guess that Mac Roman is more common than the Icelandic or Central European encodings.

Try using 'MacRoman' as your source encoding with Iconv:

>> Iconv.conv("UTF-8", "MacRoman", "Stefan,Engstr\232m")
=> "Stefan,Engström"

Upvotes: 5

Related Questions