Benedikt B
Benedikt B

Reputation: 753

Converting UTF-8 characters into properly ASCII characters

I have the string "V\355ctor" (I think that's Víctor). Is there a way to convert it to ASCII where í would be replaced by an ASCII i?

I already have tried Iconv without success. (I'm only getting Iconv::IllegalSequence: "\355ctor")

Further, are there differences between Ruby 1.8.7 and Ruby 2.0?

EDIT: Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "V\355ctor") this seems to work but the result is Vctor not Victor

Upvotes: 3

Views: 1773

Answers (2)

Mark Thomas
Mark Thomas

Reputation: 37517

I know of two options.

  1. transliterate from the I18n gem.

    $ irb
    1.9.3-p448 :001 > string = "Víctor"
     => "Víctor" 
    1.9.3-p448 :002 > require 'i18n'
     => true 
    1.9.3-p448 :003 > I18n.transliterate(string)
     => "Victor"
    
  2. Unidecoder from the stringex gem.

    Stringex::Unidecoder..decode(string)
    

Update:

When running Unidecoder on "V\355ctor", you get the following error:

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with IBM437 string)

Hmm, maybe you want to first translate from IBM437:

string.force_encoding('IBM437').encode('UTF-8')

This may help you get further. Note that the autodetected encoding could be incorrect, if you know exactly what the encoding is, it would make everything a lot easier.

Upvotes: 8

Denis de Bernardy
Denis de Bernardy

Reputation: 78423

What you want to do is called transliteration.

The most used and best maintained library for this is ICU. (Iconv is frequently used too, but it has many limitations such as the one you ran into.)

A cursory Google search yields a few ruby ICU wrappers. I'm afraid I cannot comment on which one is better, since I've admittedly never used any of them. But that is the kind of stuff you want to be using.

Upvotes: 4

Related Questions