Gaurav Pant
Gaurav Pant

Reputation: 4199

convert a string with unicode characters to lower case

Problem Statement - I am processing some data files. in that data dump I have some strings which contain unicode values of characters.Characters may be in upper case and lower case both. Now I need to do below processing on this string.

1- if there is any - , _ ) ( } { ] [ ' " then delete them.All these characters are there in string in its Unicode form as ( $4-hexa-digits)

2- All upper case characters need to be converted to lower case ( including all different unicode characters 'Φ' -> 'φ', 'Ω' -> 'ω', 'Ž' -> 'ž')

3- Later I will use this final string for matching for different user inputs.

Problem detail description-- I have some strings like Buna$002C_Texas , Zamboanga_$0028province$0029 and many more.

Here $002C, $0028 and $0029 are unicode values and I am converting them to their character representation using below .

$str =~s/\$(....)/chr(hex($1))/eg;

OR

$str =~s/\$(....)/pack 'U4', $1/eg;

Now I am substituting all the characters as per my requirement. Then I am decoding the string into utf-8 to get lowercase of all the characters including unicode as below as lc directly do not support unicode characters.

$str =~ s/(^\-|\-$|^\_|\_$)//g;                        
$str =~ s/[\-\_,]/ /g;                                                                         
$str =~ s/[\(\)\"\'\.]|ʻ|’|‘//g;                                                                                       
$str =~ s/^\s+|\s+$//g;
$str =~ s/\s+/ /g;
$str = decode('utf-8',$str);
$str = lc($str);
$str = encode('utf-8',$str);

But I am getting below error when Perl tries to decode the string.

Cannot decode string with wide characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173

This error is also obvious as described here. @ http://www.perlmonks.org/?node_id=569402

Now I changed my logic as per above url. I have used below to convert unicode to character representation.

$str =~s/\$(..)(..)/chr(hex($1)).chr(hex($2))/eg;

But now I do not get the character representation.I gets something non-printable character. So how to deal with this problem when I am not aware how many different unicode representation will be there.

Upvotes: 0

Views: 1980

Answers (1)

amon
amon

Reputation: 57600

You want to decode the string before you do your transformations, preferably by using an PerlIO-layer like :utf8. Because you interpolate the escaped codepoints before decoding, your string may already contain multi-byte characters. Remember, Perl (seemingly) operates on codepoints, not bytes.

So what we'll do is the following: decode, unescape, normalize, remove, case fold:

 use strict; use warnings;
 use utf8;  # This source file holds Unicode chars, should be properly encoded
 use feature 'unicode_strings'; # we want Unicode semantics everywhere
 use Unicode::CaseFold; # or: use feature 'fc'
 use Unicode::Normalize;

 # implicit decode via PerlIO-layer
 open my $fh, "<:utf8", $file or die ...;
 while (<$fh>) {
   chomp;

   # interpolate the escaped code points
   s/\$(\p{AHex}{4})/chr hex $1/eg;

   # normalize the representation
   $_ = NFD $_;  # or NFC or whatever you like

   # remove unwanted characters. prefer transliterations where possible,
   # as they are more efficient:
   tr/.ʻ//d;
   s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g;  # I suppose you want to remove *all* quotation marks?
   tr/-_,/   /;
   s/\A\s+//;
   s/\s+\z//;
   s/\s+/ /g;

   # finally normalize case
   $_ = fc $_

   # store $_ somewhere.
 }

You may be interested in perluniprops, a list of all available Unicode character properties, like Quotation_Mark, Punct (punctuation), Dash (dashes like - – —), Open_Punctuation (parens like ({[〈 and quotation marks like „“) etc.

Why do we perform unicode normalization? Some graphemes (visual characters) can have multiple distinct representations. E.g á can be represented as “a with acute“ or “a” + “combining acute”. The NFC tries to combine the information into one code point, whereas NFD decomposes such information into multiple code points. Note that these operations change the length of the string, as the length is measured in code points.

Before outputting data which you decomposed, it might be good to recompose it again.

Why do we use case folding with fc instead of lowercasing? Two lowercase characters may be equivalent, but wouldn't compare the same, e.g. the Greek lowercase sigma: σ and ς. Case folding normalizes this. The German ß is uppercased as the two-character sequence SS. Therefore, "ß" ne (lc uc "ß"). Case folding normalizes this, and transforms the ß to ss: fc("ß") eq fc(uc "ß"). (But whatever you do, you will still have fun with Turkish data).

Upvotes: 5

Related Questions