Yetimwork Beyene
Yetimwork Beyene

Reputation: 249

What's the correct syntax to filter the MIDDLE DOT Unicode character using a Perl regex?

I'm trying to figure out the correct syntax to filter the MIDDLE DOT Unicode character (U+00B7) out of a string and keep the original string

     $_ =~ s/test_of_character (.*[^\x{00b7}])/$1/gi;

From the code above, I'm not sure how to keep the original string before removing the middle dot from the string.

Upvotes: 0

Views: 2527

Answers (3)

Borodin
Borodin

Reputation: 126722

To remove all Unicode MIDDLE DOT characters from the string, you can write

s/\N{MIDDLE DOT}//g

or

tr/\N{MIDDLE DOT}//d

I'm not clear what you mean by "keep the original string", but if you want to leave $_ unchanged and remove MIDDLE DOT characters from a copy of it then you can write

(my $modified = $_) =~ s/\N{MIDDLE DOT}//g

or

my $modified = s/\N{MIDDLE DOT}//gr

Upvotes: 5

Jonathan Leffler
Jonathan Leffler

Reputation: 754060

If you're using Perl and Unicode, you should read the manuals such as:

The first of those shows that you can write a Unicode code point such as U+00B7 using the notation:

\N{U+00B7}

You can also use the Unicode character name:

\N{MIDDLE DOT}

The rest is basic regex handling. If you need to keep the original string, then you can use the /rmodifier for the regex if your Perl is modern enough (added to Perl 5.14.0). Alternatively (for older versions of Perl), you can copy the string and edit the copy, as with $altans below.

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;

binmode(STDOUT, ":utf8");

my $string = "This is some text with a ·•· middle dot or four \N{U+00B7}\N{MIDDLE DOT} in it";

print "string = $string\n";

my $answer = ($string =~ s/\N{MIDDLE DOT}//gr);
my $altans;

($altans = $string) =~ s/\N{U+00B7}//g;

# Fix grammar!
$answer =~ s/\ba\b/no/;
$answer =~ s/ or four //;

print "string = $string\n";
print "answer = $answer\n";
print "altans = $altans\n";

Output:

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it

Note that the 'big middle dot' is U+2022, BULLET.


ikegami points out in a comment:

Note that \x{00B7} and \xB7 would match the same character as \N{U+00B7}.

And indeed, that is the case, as this extension of the code above shows:

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'unicode_strings';
use utf8;

binmode(STDOUT, ":utf8");

my $string = "This is some text with a ·•· middle dot or four \N{U+00B7}\N{MIDDLE DOT} in it";

print "string = $string\n";

my $answer = ($string =~ s/\N{MIDDLE DOT}//gr);
my $altans;

($altans = $string) =~ s/\N{U+00B7}//g;

# Fix grammar!
$answer =~ s/\ba\b/no/;
$answer =~ s/ or four //;

print "string = $string\n";
print "answer = $answer\n";
print "altans = $altans\n";

my $extan1 = $string;
$extan1 =~ s/\xB7//g;
print "extan1 = $extan1\n";

my $extan2 = $string;
$extan2 =~ s/\x{00B7}//g;
$extan2 =~ s/\x{0065}//g;
$extan2 =~ s/\x{2022}//g;
print "extan2 = $extan2\n";

With the output:

string = This is some text with a ·•· middle dot or four ·· in it
string = This is some text with a ·•· middle dot or four ·· in it
answer = This is some text with no • middle dot in it
altans = This is some text with a • middle dot or four  in it
extan1 = This is some text with a • middle dot or four  in it
extan2 = This is som txt with a  middl dot or four  in it

This is Perl: TMTOWTDI — There's More Than One Way To Do It!

Upvotes: 3

user557597
user557597

Reputation:

This, a general answer using your own regex, slightly modified

$_ =~ s/([^\x{00b7}]*+)\x{00b7}+/$1/g;

The inverse ( preferred ) equivalent is

$_ =~ s/\x{00b7}+//g;

Upvotes: 0

Related Questions