Sobrique
Sobrique

Reputation: 53478

Regex - zero width 'word boundary' makes an alternation pattern match correctly

With reference to: perl string catenation and substitution in a single line?

Given an input of:

home/////test/tmp/

And a desired transform to:

/home/test/tmp/

(and other file-path like patterns, that need trailing and leading slashes, but no doubles. E.g. /home/test/tmp/ passes through, but /home/test/tmp gets a trailing slash, etc.)

Using a triple regex;

s,^/*,/,;  #prefix
s,/*$,/,; #suffix
s,/+,/,g; #double slashes anywhere else. 

Gives us the right result:

#!/usr/bin/env perl

use strict;
use warnings;

my $str = 'home/////teledyne/tmp/';
$str =~ s,^/*,/,;    #prefix
$str =~ s,/*$,/,;    #suffix
$str =~ s,/+,/,g;    #double slashes anywhere else.
print $str; 

But if I try and combine these patterns using alternation, I get:

s,(^/*|/+|/*$),/,g 

Which looks like it should work... it actually doesn't, and I get a double trailing slash.

But adding a zero width match, it works fine:

s,(^/*|/+|\b/*$),/,g;

Can anyone help me understand what's happening differently in the alternation group, and is there a possible gotcha with just leaving that \b in there?

Upvotes: 3

Views: 136

Answers (2)

zdim
zdim

Reputation: 66883

The reason is that the /+ alternation under /g matches the last slash – and the search then goes on because of the presence of the anchor. It continues from the position after the last substitution, thus after the last slash. That search matches zero slashes at $ and adds /.

We can see this by

perl -wE'
    $_ = "home/dir///end/"; 
    while (m{( ^/* | /+ | /*$ )}gx) { say "Got |$1| at ", pos }
'

which prints (with aligned at ... for readability)

Got ||    at 0
Got |/|   at 5
Got |///| at 11
Got |/|   at 15
Got ||    at 15

With the actual substitution

s{( ^/* | /+ | /*$ )}{ say "Got |$1| at ", pos; q(/) }egx

the numbers differ as they refer to positions in the intermediate strings, where the last two

...
Got |/| at 14
Got ||  at 15

are telling.

I don't see what can go wrong with having \b, as in the question or as /*\b$.


This is an interesting question, but I'd like to add that all these details are avoided by

$_ = '/' . (join '/', grep { /./ } split '/', $_) . '/'  for @paths;

Upvotes: 2

anubhava
anubhava

Reputation: 785108

Here is a single regex to do all:

s='home/////test/tmp/'
perl -pe 's~^(?!/)|(?<!/)$|/{2,}~/~g' <<< "$s"
/home/test/tmp/

s='home/test/tmp'
perl -pe 's~^(?!/)|(?<!/)$|/{2,}~/~g' <<< "$s"
/home/test/tmp/

Regex Breakup:

^(?!/) # Line start if not followed by /
|
(?<!/)$ # Line end if not preceded by /
|
/{2,} # 2 or more /

Upvotes: 0

Related Questions