Reputation: 4199
Problem Statement - I am processing some data files. in that data dump I have some strings which contain unicode values of characters.Characters may be in upper case and lower case both. Now I need to do below processing on this string.
1- if there is any - , _ ) ( } { ] [ ' " then delete them.All these characters are there in string in its Unicode form as ( $4-hexa-digits)
2- All upper case characters need to be converted to lower case ( including all different unicode characters 'Φ' -> 'φ', 'Ω' -> 'ω', 'Ž' -> 'ž')
3- Later I will use this final string for matching for different user inputs.
Problem detail description-- I have some strings like Buna$002C_Texas , Zamboanga_$0028province$0029
and many more.
Here $002C, $0028
and $0029
are unicode values and I am converting them to their character representation using below .
$str =~s/\$(....)/chr(hex($1))/eg;
OR
$str =~s/\$(....)/pack 'U4', $1/eg;
Now I am substituting all the characters as per my requirement. Then I am decoding the string into utf-8 to get lowercase of all the characters including unicode as below as lc directly do not support unicode characters.
$str =~ s/(^\-|\-$|^\_|\_$)//g;
$str =~ s/[\-\_,]/ /g;
$str =~ s/[\(\)\"\'\.]|ʻ|’|‘//g;
$str =~ s/^\s+|\s+$//g;
$str =~ s/\s+/ /g;
$str = decode('utf-8',$str);
$str = lc($str);
$str = encode('utf-8',$str);
But I am getting below error when Perl tries to decode the string.
Cannot decode string with wide characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173
This error is also obvious as described here. @ http://www.perlmonks.org/?node_id=569402
Now I changed my logic as per above url. I have used below to convert unicode to character representation.
$str =~s/\$(..)(..)/chr(hex($1)).chr(hex($2))/eg;
But now I do not get the character representation.I gets something non-printable character. So how to deal with this problem when I am not aware how many different unicode representation will be there.
Upvotes: 0
Views: 1980
Reputation: 57600
You want to decode the string before you do your transformations, preferably by using an PerlIO-layer like :utf8
. Because you interpolate the escaped codepoints before decoding, your string may already contain multi-byte characters. Remember, Perl (seemingly) operates on codepoints, not bytes.
So what we'll do is the following: decode, unescape, normalize, remove, case fold:
use strict; use warnings;
use utf8; # This source file holds Unicode chars, should be properly encoded
use feature 'unicode_strings'; # we want Unicode semantics everywhere
use Unicode::CaseFold; # or: use feature 'fc'
use Unicode::Normalize;
# implicit decode via PerlIO-layer
open my $fh, "<:utf8", $file or die ...;
while (<$fh>) {
chomp;
# interpolate the escaped code points
s/\$(\p{AHex}{4})/chr hex $1/eg;
# normalize the representation
$_ = NFD $_; # or NFC or whatever you like
# remove unwanted characters. prefer transliterations where possible,
# as they are more efficient:
tr/.ʻ//d;
s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g; # I suppose you want to remove *all* quotation marks?
tr/-_,/ /;
s/\A\s+//;
s/\s+\z//;
s/\s+/ /g;
# finally normalize case
$_ = fc $_
# store $_ somewhere.
}
You may be interested in perluniprops, a list of all available Unicode character properties, like Quotation_Mark
, Punct
(punctuation), Dash
(dashes like - – —), Open_Punctuation
(parens like ({[〈
and quotation marks like „“
) etc.
Why do we perform unicode normalization? Some graphemes (visual characters) can have multiple distinct representations. E.g á
can be represented as “a
with acute“ or “a” + “combining acute”. The NFC
tries to combine the information into one code point, whereas NFD
decomposes such information into multiple code points. Note that these operations change the length of the string, as the length is measured in code points.
Before outputting data which you decomposed, it might be good to recompose it again.
Why do we use case folding with fc
instead of lowercasing? Two lowercase characters may be equivalent, but wouldn't compare the same, e.g. the Greek lowercase sigma: σ
and ς
. Case folding normalizes this. The German ß
is uppercased as the two-character sequence SS
. Therefore, "ß" ne (lc uc "ß")
. Case folding normalizes this, and transforms the ß
to ss
: fc("ß") eq fc(uc "ß")
. (But whatever you do, you will still have fun with Turkish data).
Upvotes: 5