Reputation: 1162
I am writing a simple script in Perl to check string for different wordforms (in english and russian) of a nickname. I would use the next regex: /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i
- which is valid, according to regex101.com test and Notepad++. However, on my computer in Perl this regex doesn't work unless I put additional parentheses to ?
and |
: /((gun(n)?er)|(gun(n)?)|(ган(н)?ер(у)?)|(ган(н)?(у)?)/i
. My friend, whom I've asked of this, couldn't reproduce this behavior. Is it some kind of setting of script or Perl interpreter itself that I should change?
Edit: As requested, the code of my tests:
#!/usr/bin/perl
my $GUN = "gunner";
my $HZ = "!!!";
sub GetNickFromMsg
{
my ($msg) = @_;
if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i )
{
return $GUN
}
return $HZ;
}
my @nicks = ("Gunner", "guner", "ганнер", "ганеру", "гану");
foreach $n (@nicks)
{
my $res = GetNickFromMsg($n);
print "$n -> $res\n");
}
The output I get:
Gunner -> !!!
guner -> !!!
ганнер -> !!!
ганеру -> !!!
гану -> !!!
If I change the regex to the second version, with parentheses everywhere, the output for every wordform is "-> gunner" as it should be. I've tried to add use feature 'unicode_strings'
to the beginning of the script and use ui
instead of i
modifier as Casimir supposed, but it didn't help.
I launch the script on Linux server, Linux version 4.3.0-1-amd64 ([email protected]) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP Debian 4.3.3-5 (2016-01-04)
with Perl version 5.22.1
Upvotes: 1
Views: 111
Reputation: 126722
You need to add use utf8
at the top of your program to specify that your program code uses UTF-8-encoded characters
You will also need to set STDOUT to handle UTF-8 encoding, otherwise you will get Wide character in print
warnings
Here's an edited version of your program that works correctly and provides the behaviour that you expected
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(UTF-8) /;
my $GUN = 'gunner';
my $HZ = '!!!';
sub GetNickFromMsg {
my ($msg) = @_;
if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i ) {
return $GUN;
}
return $HZ;
}
my @nicks = qw/ Gunner guner ганнер ганеру гану /;
foreach my $n (@nicks) {
my $res = GetNickFromMsg($n);
print "$n -> $res\n";
}
Gunner -> gunner
guner -> gunner
ганнер -> gunner
ганеру -> gunner
гану -> gunner
Upvotes: 4