geekley
geekley

Reputation: 1715

How can I use unicode characters in perl regex substitution command?

This doesn't work when using unicode characters (in Ubuntu bash):

$ perl -pC -e's/[à]/a/gu' <<< 'à'
à
$ perl -pC -e's/[b]/a/gu' <<< 'b'
a

Even though it seems to be supported by PCRE (at least according to regex101).

What am I doing wrong? Am I missing some flag in the perl command?

This "just works" in javascript, so I would be using node if I could come up with a simple one-liner for this in command line ... but I still want to know why the perl command is not working.


For context:

I'm trying to use substitutions like /[àâáãä]/a/g, /[òôóõö]/o/g, etc to asciify a dictionary file (i.e. remove accents, etc. of a word list), so I can use it to make spell-checking accent-insensitive (e.g. in IntelliJ Idea).

Basically these are the steps to make an "asciified" extra dictionary:

  1. Download the .dic file for the language (list of all words)
  2. Use grep to filter words containing non-ascii / replaceable characters
  3. Use regex substitutions in succession to make words accent-insensitive
  4. Import the asciified .dic file in the IDE (in addition to the standard language dictionary)

Upvotes: 2

Views: 926

Answers (4)

zdim
zdim

Reputation: 66873

One practical approach for all of it is to use Text::Unidecode

perl -C -MText::Unidecode -pe'unidecode($_)'  <<< 'à'

Prints a. The module transliterates Unicode text into plain ASCII.

Another approach: decompose characters ("normalize") using Unicode::Normalize, so that the character and its diacritical marks (combining accents) are separated into their own code points, while they still form a valid grapheme, then remove the diacriticals (\p{NonspacingMark} or \p{Mn}) with a simple regex.

Both of these ways will have exceptions and edge cases but I think it may just do what you need.


As for code containing specific (literal) characters, need to tell Perl that the program source is then UTF-8, via the utf8 pragma with use utf8; or with a command-line flag -Mutf8

perl -C -Mutf8 -pe's/[à]/a/g' <<< 'à'

Upvotes: 8

lordadmira
lordadmira

Reputation: 1832

The short answer is to add -Mutf8 to your command line.

If you're not sure how Perl is interpreting what you wrote on the command line you can make it spit it back to you with the core B::perlstring() function or deparse the whole script with B::Deparse. That would illustrate your problem real fast. (Enclosing the 'à' character in brackets doesn't do anything here.)

$ perl -MO=Deparse -pC -e 's/à/a/gu' <<< 'à'

LINE: while (defined($_ = <ARGV>)) {
    s/\303\240/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

See how your substitution stragely has 2 characters in it?

You can then see immediately how use utf8 fixes your problem.

$ perl -MO=Deparse -Mutf8 -pC -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

You can use perlstring() to make sure Perl is receiving the input you think.

$ perl -p -MB -E 'say B::perlstring($_)' <<< 'à'
"\303\240\n"
à
$ perl -pC -MB -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

You can see that without -C Perl is receiving 2 decomposed characters.

Depending on the circumstances, Perl dumps characters as either an octal code (\340) or a hexadecimal code (\xE0). Note well here that you can always replace raw unicode characters in your command line with the escape code version. This is a great way to make explicit what otherwise would be ambiguous.

$ perl -pC -e 's/[\xE0]/a/gu' <<< 'à'
a

If you don't want to have to remember UTF8 mode, you can shove those options in the PERL5OPT environment variable or create a shell alias. Beware of making this global!

$ export PERL5OPT='-C -Mutf8'
$ perl -MO=Deparse -p -e 's/à/a/gu' <<< 'à'
use utf8;
LINE: while (defined($_ = <ARGV>)) {
    s/\340/a/gu;
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

$ perl -MB -p -E 'say B::perlstring($_)' <<< 'à'
"\x{e0}\n"
à

Or as a shell alias.

alias uperl='perl -C -Mutf8'

See perlrun for more information on how to Swiss Army Chainsaw the command line.

See also B::Deparse.

Upvotes: 2

geekley
geekley

Reputation: 1715

Here's how I implemented steps 2 and 3.
This can be used, e.g., in these dictionaries (though I didn't test it on every language).

asciify-dic

#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
if [[ "$1" == "--help" ]]; then
  echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
  echo "Asciify a .dic file (list of dictionary words)."
  echo ""
  echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
  echo "These additional words can be used to make spell-checking accent-insensitive."
  echo "Comment lines beginning with % are left unchanged."
  exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'

Example usage:

asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic

Upvotes: 1

BarneySchmale
BarneySchmale

Reputation: 780

You need to add -Mutf8 to tell Perl the program is encoded using UTF-8 rather than ASCII.

$ perl -pC -Mutf8 -e's/[à]/a/gu' <<< 'à'
a

Upvotes: 4

Related Questions