Adrian Grigore
Adrian Grigore

Reputation: 33318

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).

This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.

So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?

It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

Upvotes: 5

Views: 10546

Answers (4)

Larry McPhillips
Larry McPhillips

Reputation: 31

In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:

Make sure that the input data really is a utf8 string.

UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:

Make sure that the input data really is a string of two-byte Unicode characters.

This is also referred to as UCS-2.

If you want to convert strings which really are utf8, you would do it like so:

my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);

Upvotes: 1

Nemanja Trifunovic
Nemanja Trifunovic

Reputation: 24551

If you get cyrilic text there is no "closest ASCII representation" for many characters.

Upvotes: 0

mirod
mirod

Reputation: 16136

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

Upvotes: 10

innaM
innaM

Reputation: 47829

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

Upvotes: 0

Related Questions