Artalus
Artalus

Reputation: 1162

Perl: regex won't work without parentheses

I am writing a simple script in Perl to check string for different wordforms (in english and russian) of a nickname. I would use the next regex: /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i - which is valid, according to regex101.com test and Notepad++. However, on my computer in Perl this regex doesn't work unless I put additional parentheses to ? and |: /((gun(n)?er)|(gun(n)?)|(ган(н)?ер(у)?)|(ган(н)?(у)?)/i. My friend, whom I've asked of this, couldn't reproduce this behavior. Is it some kind of setting of script or Perl interpreter itself that I should change?

Edit: As requested, the code of my tests:

#!/usr/bin/perl
my $GUN = "gunner";
my $HZ = "!!!";

sub GetNickFromMsg
{
    my ($msg) = @_;
    if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i )
    {
        return $GUN
    }
    return $HZ;
}

my @nicks = ("Gunner", "guner", "ганнер", "ганеру", "гану");
foreach $n (@nicks)
{
    my $res = GetNickFromMsg($n);
    print "$n -> $res\n");
}

The output I get:

Gunner -> !!!
guner -> !!!
ганнер -> !!!
ганеру -> !!!
гану -> !!!

If I change the regex to the second version, with parentheses everywhere, the output for every wordform is "-> gunner" as it should be. I've tried to add use feature 'unicode_strings' to the beginning of the script and use ui instead of i modifier as Casimir supposed, but it didn't help.

I launch the script on Linux server, Linux version 4.3.0-1-amd64 ([email protected]) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP Debian 4.3.3-5 (2016-01-04) with Perl version 5.22.1

Upvotes: 1

Views: 111

Answers (1)

Borodin
Borodin

Reputation: 126722

You need to add use utf8 at the top of your program to specify that your program code uses UTF-8-encoded characters

You will also need to set STDOUT to handle UTF-8 encoding, otherwise you will get Wide character in print warnings

Here's an edited version of your program that works correctly and provides the behaviour that you expected

#!/usr/bin/perl

use utf8;
use strict;
use warnings 'all';

use open qw/ :std :encoding(UTF-8) /;

my $GUN = 'gunner';
my $HZ  = '!!!';

sub GetNickFromMsg {
    my ($msg) = @_;

    if ( $msg =~ /(gunn?er|gunn?|ганн?еру?|ганн?у?)/i ) {
        return $GUN;
    }

    return $HZ;
}

my @nicks = qw/ Gunner guner ганнер ганеру гану /;

foreach my $n (@nicks) {
    my $res = GetNickFromMsg($n);
    print "$n -> $res\n";
}

output

Gunner -> gunner
guner -> gunner
ганнер -> gunner
ганеру -> gunner
гану -> gunner

Upvotes: 4

Related Questions