clt60
clt60

Reputation: 63902

How to combine multiple Unicode properties in perl regex?

Have this script:

use 5.014;
use warnings;

use utf8;    
binmode STDOUT, ':utf8';

my $str = "XYZ ΦΨΩ zyz φψω";

my @greek = ($str =~ /\p{Greek}/g);
say "Greek: @greek";

my @upper = ($str =~ /\p{Upper}/g);
say "Upper: @upper";

#my @upper_greek = ($str =~ /\p{Upper+Greek}/); #wrong.
#say "Upper+Greek: @upper_greek";

Is possible combine multiple unicode properties? E.g how to select only Upper and Greek, and get the wanted:

Greek: Φ Ψ Ω φ ψ ω
Upper: X Y Z Φ Ψ Ω
Upper+Greek: Φ Ψ Ω      #<-- how to get this?

Upvotes: 9

Views: 313

Answers (2)

ikegami
ikegami

Reputation: 385754

We want to perform an AND operation, so we can't use

/(?:\p{Greek}|\p{Upper})/         # Greek OR Upper

or

/[\p{Greek}\p{Upper}]/            # Greek OR Upper

Since 5.18, one can use regex sets.

/(?[ \p{Greek} & \p{Upper} ])/    # Greek AND Upper

This requires use experimental qw( regex_sets ); before 5.36. But it's safe to add this and use the feature as far back as its introduction as an experimental feature in 5.18, since no change was made to the feature since then.


There are some other approaches that can be used in older versions of Perl, but they are indisputably harder to read.

One way of achieving AND in a regex is using lookarounds.

/\p{Greek}(?<=\p{Upper})/         # Greek AND Upper

Another way of getting an AND is to negate an OR. De Morgan's laws tells us

NOT( Greek AND Upper )  ⇔  NOT(Greek) OR NOT(Upper)

so

Greek AND Upper  ⇔  NOT( NOT(Greek) OR NOT(Upper) )

This gives us

/[^\P{Greek}\P{Upper}]/           # Greek AND Upper

This is more efficient then using a lookbehind.

Upvotes: 12

Tanktalus
Tanktalus

Reputation: 22254

This works in 5.14.0 as well:

sub InUpperGreek {
    return <<'END'
+utf8::Greek
&utf8::Upper
END
}

my @upper_greek = ($str =~ /\p{InUpperGreek}/g);
say "Upper Greek: @upper_greek";

Not sure if that's simpler. :) For more information on how this works, see the perlunicode documentation on user-defined character properties.

Upvotes: 7

Related Questions