nevinleiby
nevinleiby

Reputation: 11

Unable to match unicode character in Perl in regex (\x{00E7})

I've been using Perl for a while, but I've been hung up on matching unicode data in an input file. Most of these are currency symbols, but it doesn't seem to be completely predictable. Various currencies from around the world are included and I'm trying to convert the unicode symbol into the 3 letter abbreviation.

I'm trying to match ç which appears in some instances when I'm scanning a file for a Euro €. The pattern doesn't seem to be able to recognize the ç.

Here's what I have so far:

use strict;
use utf8;
#binmode(STDOUT, ":utf8");
use open qw/:std :utf8/;

open (FILE_INPUT, "$source_file") || die "Unable to open source file: $source_file: $!\n";
LINE: while (my $line_input = <FILE_INPUT>)
{
   chomp $line_input;
   ....
   $input_price = '7.50 Ç';

   ## This regex rarely seems to match, no matter what I do:
   if ($input_price =~ /\s?\P{c}\s?/)
   {
        ## We have a match! Please remove this unicode:
        $input_price =~ s/(\P{c})/EUR /;
        print "Converted price field: ($input_price)\n";
   }
}

But then my output is:

EUR.50 Ç

I've also tried varies forms of using \x and UTF-8 codes to attempt to explicitly match the character, but the regex doesn't match: https://www.compart.com/en/unicode/U+00E7

For example:

if ($input_price =~ /\x{e7}/)   { ... }
if ($input_price =~ /\x{00e7}/) { ... }
if ($input_price =~ /\x{c3}/)   { ... }
if ($input_price =~ /\x{00c3}/) { ... }
if ($input_price =~ /\x{a7}/)   { ... }
if ($input_price =~ /\x{00a7}/) { ... }
if ($input_price =~ /\x{0063}/) { ... }
if ($input_price =~ /\x{0327}/) { ... }

And not a single match occurs. Ive read through Programming Perl, http://www.regular-expressions.info/unicode.html and a ton of other resources, but I'm completely stumped.

Thanks so much!!

Upvotes: 1

Views: 272

Answers (1)

Polar Bear
Polar Bear

Reputation: 6798

Please investigate the following code snippet for compliance with your problem.

NOTE: run script.pl inputfile.dat

use strict;
use warnings;

binmode(STDOUT, ':utf8');

s/ (Ç|ç)/ EUR/g && print while <>;

Data input file

7.50 Ç
7.50 ç

Output

7.50 EUR
7.50 EUR

Notice: tested in Windows 10 code page 437

Following code snippet produces same result

use strict;
use warnings;

my $fname = 'utf8_regex.dat';

open my $fh, '<', $fname or die $!;

s/ (Ç|ç)/ EUR/g && print while <>;

close $fh;

Upvotes: 1

Related Questions