Reputation: 1715
This doesn't work when using unicode characters (in Ubuntu bash):
$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a
Even though it seems to be supported by PCRE (at least according to regex101).
What am I doing wrong? Am I missing some flag in the perl command?
This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line ... but I still want to know why the perl command is not working.
For context:
I'm trying to use substitutions like /[àâáãä]/a/g
, /[òôóõö]/o/g
, etc to asciify a dictionary file (i.e. remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (e.g. in IntelliJ Idea).
Basically these are the steps to make an "asciified" extra dictionary:
Upvotes: 2
Views: 926
Reputation: 66873
One practical approach for all of it is to use Text::Unidecode
perl -C -MText::Unidecode -pe'unidecode($_)' <<< 'à'
Prints a
. The module transliterates Unicode text into plain ASCII.
Another approach: decompose characters ("normalize") using Unicode::Normalize, so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme, then remove the diacriticals (\p{NonspacingMark}
or \p{Mn}
) with a simple regex.
Both of these ways will have exceptions and edge cases but I think it may just do what you need.
As for code containing specific (literal) characters, need to tell Perl that the program source is then UTF-8, via the utf8 pragma with use utf8;
or with a command-line flag -Mutf8
perl -C -Mutf8 -pe's/[à]/a/g' <<< 'à'
Upvotes: 8
Reputation: 1832
The short answer is to add -Mutf8
to your command line.
If you're not sure how Perl is interpreting what you wrote on the command line you can make it spit it back to you with the core B::perlstring()
function or deparse the whole script with B::Deparse
. That would illustrate your problem real fast. (Enclosing the 'à' character in brackets doesn't do anything here.)
$ perl -MO=Deparse -pC -e 's/à/a/gu' <<< 'à'
LINE: while (defined($_ = <ARGV>)) {
s/\303\240/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
See how your substitution stragely has 2 characters in it?
You can then see immediately how use utf8
fixes your problem.
$ perl -MO=Deparse -Mutf8 -pC -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
s/\340/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
You can use perlstring()
to make sure Perl is receiving the input you think.
$ perl -p -MB -E 'say B::perlstring($_)' <<< 'à'
"\303\240\n"
à
$ perl -pC -MB -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à
You can see that without -C
Perl is receiving 2 decomposed characters.
Depending on the circumstances, Perl dumps characters as either an octal code (\340
) or a hexadecimal code (\xE0
). Note well here that you can always replace raw unicode characters in your command line with the escape code version. This is a great way to make explicit what otherwise would be ambiguous.
$ perl -pC -e 's/[\xE0]/a/gu' <<< 'à'
a
If you don't want to have to remember UTF8 mode, you can shove those options in the PERL5OPT
environment variable or create a shell alias. Beware of making this global!
$ export PERL5OPT='-C -Mutf8'
$ perl -MO=Deparse -p -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
s/\340/a/gu;
}
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
$ perl -MB -p -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à
Or as a shell alias.
alias uperl='perl -C -Mutf8'
See perlrun for more information on how to Swiss Army Chainsaw the command line.
See also B::Deparse.
Upvotes: 2
Reputation: 1715
Here's how I implemented steps 2 and 3.
This can be used, e.g., in these dictionaries (though I didn't test it on every language).
asciify-dic
#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
if [[ "$1" == "--help" ]]; then
echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
echo "Asciify a .dic file (list of dictionary words)."
echo ""
echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
echo "These additional words can be used to make spell-checking accent-insensitive."
echo "Comment lines beginning with % are left unchanged."
exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'
Example usage:
asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic
Upvotes: 1
Reputation: 780
You need to add -Mutf8
to tell Perl the program is encoded using UTF-8 rather than ASCII.
$ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a
Upvotes: 4