ttsiodras
ttsiodras

Reputation: 11258

A regular expression mystery

I am investigating a regexp mystery. I am tired so I may be missing something obvious - but I can't see any reason for this.

In the examples below, I use perl - but I first saw this in VIM, so I am guessing it is something related to more than one regexp-engines.

Assume we have this file:

$ cat data
1 =2   3 =4
5 =6  7 =8

We can then delete the whitespace in front of the '=' with...

$ cat data | perl -ne 's,(.)\s+=(.),\1=\2,g; print;'
1=2   3=4
5=6  7=8

Notice that in every line, all instances of the match are replaced ; we used the /g search modifier, which doesn't stop at the first replace, and instead goes on replacing till the end of the line.

For example, both the space before the '=2' and the space before the '=4' were removed ; in the same line.

Why not use simpler constructs like 's, =,=,g'? Well, we were preparing for more difficult scenarios... where the right-hand side of the assignments are quoted strings, and can be either single or double-quoted:

$ cat data2
1 ="2"   3 ='4 ='
5 ='6'  7 ="8"

To do the same work (remove the whitespace before the equal sign), we have to be careful, since the strings may contain the equal sign - so we mark the first quote we see, and look for it via back-references:

$ cat data2 | perl -ne 's,(.)\s+=(.)([^\2]*)\2,\1=\2\3\2,g; print;'
1="2"   3='4 ='
5='6'  7="8"

We used the back-reference \2 to search for anything that is not the same quote as the one we first saw, any number of times ([^\2]*). We then searched for the original quote itself (\2). If found, we used back references to refer to the matched parts in the replace target.

Now look at this:

$ cat data3 
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

What we want here, is to drop the last space character that exists before all the instances of '=' in every line. Like before, we can't use a simple 's, =",=",g', because the strings themselves may contain the equal sign.

So we follow the same pattern as we did above, and use back-references:

$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,g; print;"
posAndWidth="40:5 ="   height        ="1"
posAndWidth="-1:8 ='"  textAlignment ="Right"

It works... but only on the first match of the line! The space following 'textAlignment' was not removed, and neither was the one on top of it (the 'height' one).

Basically, it seems that /g is not functional anymore: running the same replace command without /g produces exactly the same output:

$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,; print;"
posAndWidth="40:5 ="   height        ="1"
posAndWidth="-1:8 ='"  textAlignment ="Right"

It appears that in this regexp, the /g is ignored. Any ideas why?

Upvotes: 4

Views: 454

Answers (2)

cooltea
cooltea

Reputation: 1113

I will elaborate on my comment to TLP's answer:

ttsiodras you are asking two questions:

1- why does your regex not produce the desired result? why does the g flag not work?

The answer is because your regular expression contains this part [^\3] which is not handled correctly: \3 is not recognised as a back reference. I looked for it but could not find a way to have a back reference in character class.

2- how do you remove the space preceding an equal sign and leave alone the part that comes after and is between quotes?

This would be a way to do it (see this reference):

$ cat data3 | perl -pe "s,(([\"']).*?\2)| (=),\1\3,g"
posAndWidth="40:5 ="   height       ="1"
posAndWidth="-1:8 ='"  textAlignment="Right"

The 1st part of the regex catches whatever is between quotes (single or double) and is replaced by the match, the second part corresponds to the equal sign preceded by a space that you are looking for. Please note that this solution is only a work around the "interesting" part about the complement character class operator with back reference [^\3] by using the non-greedy operator *?


Finally if you want to pursue on the negative lookahead solution:

$ cat data3 | perl -pe 's,(\w+)(\s*) =(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g'
posAndWidth="40:5 ="   height       ="1"
posAndWidth="-1:8 ='"  textAlignment="Right"

The part with the quotes between square brackets still means "[\"']" but I had to use single quotes around the whole perl command otherwise the negative lookahead (?!...) syntax returns an error in bash.

EDIT Corrected the regex with negative lookahead: notice the non-greedy operator *? again and the g flag.

EDIT Took ttsiodras's comment into account: removed the non-greedy operator.

EDIT Took TLP's comment into account

Upvotes: 1

TLP
TLP

Reputation: 67900

Inserting some debug characters in your substitution sheds some light on the issue:

use strict;
use warnings;

while (<DATA>) {
    s,(\w+)(\s*) =(['"])([^\3]*)\3,$1$2=$3<$4>$3,g;
    print;                       #  here -^ -^
}

__DATA__
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

Output:

posAndWidth="<40:5 ="   height        ="1>"
posAndWidth="<-1:8 ='"  textAlignment ="Right>"
#            ^--------- match ---------------^

Note that the match goes through both quotes at once. It would seem that [^\3]* does not do what you think it does.

Regex is not the best tool here. Use a parser that can handle quoted strings, such as Text::ParseWords:

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @a = quotewords('\s+', 1, $_);
    print Dumper \@a;
    print "@a\n";
}

__DATA__
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

Output:

$VAR1 = [
          'posAndWidth',
          '="40:5 ="',
          'height',
          '="1"'
        ];
posAndWidth ="40:5 =" height ="1"
$VAR1 = [
          'posAndWidth',
          '="-1:8 =\'"',
          'textAlignment',
          '="Right"'
        ];
posAndWidth ="-1:8 ='" textAlignment ="Right"

I included the Dumper output so you can see how the strings are split.

Upvotes: 3

Related Questions