Andrew
Andrew

Reputation: 43

Regular expressions to search and replace decimal dashes with a normal dash in perl?

I Currently require a regular expression to search and replace all |–| with |-|. I am Currently Replacing |`| with |'| and it is working using:

while($_ =~ s/`/'/g)
{
  print "Line: '$.'. ";
  print "Found '$&'. ";
}

However using the same regex is not working for all of my below attempts:

while($_ =~ s/\–/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/\&#8211/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/\&ndash/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}
while($_ =~ s/\–/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/&#8211/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/&ndash/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

The Script Currently looks as follows:

#!/usr/bin/perl
use strict;
use warnings;
my $FILE;
my $filename = 'NoDodge.c';

open($FILE,"<service.c") or die "File not opened";
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
while (<$FILE>)
{
  while($_ =~ s/`/'/g)
  {
    print "Line: '$.'. ";
    print "Found '$&'. ";
  }
  while($_ =~ s/\&#8211/-/g)
  {
    print "Line: '$.'. ";
    print "Found '$&'.\n";
  }
  print $fh $_;
}
close $fh;
print "\nCompleted\n";

Example of Current Result:

Line: '152'. Found '`'.

Line: '162'. Found '`'.

Completed

SOLUTION: Provided by Borodin,

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw/ :std :encoding(utf8) /;

my $FILE;
my $fh;
my $readfile = 'service.c';
my $writefile = 'NoDodge.c';

open($FILE,'<',$readfile) or die qq{Unable to open "$readfile" for input: $!};
open($fh, '>',$writefile) or die qq{Unable to open "$writefile" for output: $!};
while (<$FILE>)
{
  while(s/–/-/g)
  {
    print "Found: $& on Line: $.\n";
  }

  while(s/`/'/g)
  {
    print "Found: $& on Line: $.\n";
  }

  print $fh $_;
}
close $fh;
close $FILE;
print "\nService Migrated to $writefile\n";

Example Output:

Found: – on Line: 713

Found: ` on Line: 713

Found: – on Line: 724

Found: ` on Line: 724

Found: ` on Line: 794

Service Migrated to NoDodge.c

Upvotes: 2

Views: 312

Answers (1)

Borodin
Borodin

Reputation: 126742

You need to use utf8 at the top of your program, otherwise Perl will see the individual bytes that make up the UTF-8 encoding of the en-dash (E2 80 93). There's also no need to specify $_ as the object of the substitution as it is the default, and you needn't escape an en-dash as it's not a special character within regex patterns

use utf8;

...

while( s/–/-/g ) { ... }

Or you may want to make it clearer using Unicode names, as it's far from obvious at a glance what it is you're replacing. In that case you don't need use utf8 as long as you name every non-ASCII character instead of using it literally, like this

while( s/\N{EN DASH}/-/g ) { ... }



You will also need to open the files -- both input and output -- as UTF-8-encoded. The simplest way is to set UTF-8 as the default mode. You would add this line near the top of your program

use open qw/ :std :encoding(utf8) /;

or you can open each file explicitly as UTF-8-encoded like this

my $filename = 'NoDodge.c';

open my $in_fh, '<:encoding(utf8)', 'service.c'
        or die qq{Unable to open "service.c" for input: $!};

open my $out_fh, '>:encoding(utf8)', $filename
        or die qq{Unable to open "$filename" for output: $!};

Upvotes: 4

Related Questions