Reputation: 11
I've been using Perl for a while, but I've been hung up on matching unicode data in an input file. Most of these are currency symbols, but it doesn't seem to be completely predictable. Various currencies from around the world are included and I'm trying to convert the unicode symbol into the 3 letter abbreviation.
I'm trying to match ç which appears in some instances when I'm scanning a file for a Euro €. The pattern doesn't seem to be able to recognize the ç.
Here's what I have so far:
use strict;
use utf8;
#binmode(STDOUT, ":utf8");
use open qw/:std :utf8/;
open (FILE_INPUT, "$source_file") || die "Unable to open source file: $source_file: $!\n";
LINE: while (my $line_input = <FILE_INPUT>)
{
chomp $line_input;
....
$input_price = '7.50 Ç';
## This regex rarely seems to match, no matter what I do:
if ($input_price =~ /\s?\P{c}\s?/)
{
## We have a match! Please remove this unicode:
$input_price =~ s/(\P{c})/EUR /;
print "Converted price field: ($input_price)\n";
}
}
But then my output is:
EUR.50 Ç
I've also tried varies forms of using \x and UTF-8 codes to attempt to explicitly match the character, but the regex doesn't match: https://www.compart.com/en/unicode/U+00E7
For example:
if ($input_price =~ /\x{e7}/) { ... }
if ($input_price =~ /\x{00e7}/) { ... }
if ($input_price =~ /\x{c3}/) { ... }
if ($input_price =~ /\x{00c3}/) { ... }
if ($input_price =~ /\x{a7}/) { ... }
if ($input_price =~ /\x{00a7}/) { ... }
if ($input_price =~ /\x{0063}/) { ... }
if ($input_price =~ /\x{0327}/) { ... }
And not a single match occurs. Ive read through Programming Perl, http://www.regular-expressions.info/unicode.html and a ton of other resources, but I'm completely stumped.
Thanks so much!!
Upvotes: 1
Views: 272
Reputation: 6798
Please investigate the following code snippet for compliance with your problem.
NOTE: run script.pl inputfile.dat
use strict;
use warnings;
binmode(STDOUT, ':utf8');
s/ (Ç|ç)/ EUR/g && print while <>;
Data input file
7.50 Ç
7.50 ç
Output
7.50 EUR
7.50 EUR
Notice: tested in Windows 10 code page 437
Following code snippet produces same result
use strict;
use warnings;
my $fname = 'utf8_regex.dat';
open my $fh, '<', $fname or die $!;
s/ (Ç|ç)/ EUR/g && print while <>;
close $fh;
Upvotes: 1