Kris Van Bruwaene
Kris Van Bruwaene

Reputation: 47

perl regex unexpected behaviour of /m modifier

I want to remove leading and trailing spaces from a multi-line string with this regular expression:

s/^\s*|\s*$//mg

It seems to work more or less fine in this example:

perl -e '$_=" a \n \n b\n"; s/^\s*|\s*$//mg; print "$_\n";'

which gives the result:

a
b

(unexpected to me is that the double \n with a space in between has become a single \n)

But watch this:

perl -e '$_=" a \n\n b\n"; s/^\s*|\s*$//mg; print "$_\n";'

result:

ab

Now both \n's have disappeared, the multiline string is now a single line, which is not what I want. If this is not a bug, how can I avoid this behaviour?

Upvotes: 3

Views: 195

Answers (2)

ikegami
ikegami

Reputation: 386396

\s can be match line feeds, which is leading to the problem of the removal of the line feeds.

Replace \s, with one of the following:

  • \h
    Only removes horizontal whitespace characters. While it doesn't match line feeds, it doesn't match other vertical whitespace characters either.[1]
  • (?[ \s - \n ])
    This requires use experimental qw( regex_sets ); before 5.36. But it's safe to add this and use the feature as far back as its introduction as an experimental feature in 5.18, since no change was made to the feature since then.
  • [^\S\n]
    Matches a character that's neither a non-whitespace character nor a line feed, which is to say a whitespace character that's not a line feed.

What follows details exactly how your patterns are matching.


For

␠ a ␠ ␊ ␠ ␊ ␠ b ␊
0 1 2 3 4 5 6 7 8 9

the pattern

/^\s*|\s*$/m

yields the following matches:

  1. Pos 0, len 1: is matched by ^\s*.
  2. Pos 2, len 3: ␠␊␠ is matched by \s*$. XXX
  3. Pos 5, len 0: Empty string matched by \s*$
  4. Pos 6, len 1: matched by ^\s*.
  5. Pos 8, len 1: matched by \s*$. XXX
  6. Pos 9, len 0: Empty string matched by ^\s*.

For

␠ a ␠ ␊ ␊ ␠ b ␊
0 1 2 3 4 5 6 7 8

the pattern

/^\s*|\s*$/m

yields the following matches:

  1. Pos 0, len 1: is matched by ^\s*.
  2. Pos 2, len 2: ␠␊ is matched by \s*$. XXX
  3. Pos 4, len 2: ␊␠ matched by ^\s*. XXX
  4. Pos 7, len 1: matched by \s*$. XXX
  5. Pos 8, len 0: Empty string matched by ^\s*.

Footnotes:

  1. Vertical whitespace:

    • U+000A LINE FEED
    • U+000B LINE TABULATION
    • U+000C FORM FEED
    • U+000D CARRIAGE RETURN
    • U+0085 NEXT LINE
    • U+2028 LINE SEPARATOR
    • U+2029 PARAGRAPH SEPARATOR

Upvotes: 0

TLP
TLP

Reputation: 67910

Using the -Mre=debug module and diving into the nitty gritty, I have found what I think is the answer. I removed the leading space, because it was irrelevant to the problem. I removed everything but the relevant parts. Both regexes first match the space/newline in front of the second newline by using the RHS (5:BRANCH), then sets the pointer in front of that second newline:

Case 1: String a \n \n b\n

Matching REx "^\s+|\s+$" against "%n b%n"
   4 <a %n > <%n b%n>        |   0| 1:BRANCH(5)
   4 <a %n > <%n b%n>        |   1|  2:MBOL(3)
                             |   1|  failed...
   4 <a %n > <%n b%n>        |   0| 5:BRANCH(9)
   4 <a %n > <%n b%n>        |   1|  6:PLUS(8)
                             |   1|  POSIXD[\s] can match 2 times out of 2147483647...
   6 <a %n %n > <b%n>        |   2|   8:MEOL(9)
                             |   2|   failed...
   5 <a %n %n> < b%n>        |   2|   8:MEOL(9)
                             |   2|   failed...
                             |   1|  failed...
                             |   0| BRANCH failed...
   5 <a %n %n> < b%n>        |   0| 1:BRANCH(5)  <-- HERE!
   5 <a %n %n> < b%n>        |   1|  2:MBOL(3)
   5 <a %n %n> < b%n>        |   1|  3:PLUS(9)
                             |   1|  POSIXD[\s] can match 1 times out of 2147483647...
   6 <a %n %n > <b%n>        |   2|   9:END(0)
Match successful!

In this case, the LHS (1:BRANCH) fails at first, the RHS (5:BRANCH) fails, so it moves forward 1 step, until after the newline, where LHS matches, and removes what is in front of it: a space.

In matches between the newline and the space in front of b, when the "pointer" in the regex has moved forward in front of the newline.

%n> < b%n>
^   \s

Case 2: String a \n\n b\n

Matching REx "^\s+|\s+$" against "%n b%n"
   3 <a %n> <%n b%n>         |   0| 1:BRANCH(5) <-- HERE!
   3 <a %n> <%n b%n>         |   1|  2:MBOL(3)
   3 <a %n> <%n b%n>         |   1|  3:PLUS(9)
                             |   1|  POSIXD[\s] can match 2 times out of 2147483647...
   5 <a %n%n > <b%n>         |   2|   9:END(0)
Match successful!

In this string, the zero-width assertion ^ in LHS (1:BRANCH) can see the newline to the left in the string, and allow it to match. In the other string, it had a space there, so it could not match. So the LHS alternator matches (called 1:BRANCH), and removes what is in front of it, namely newline and space \n .

Instead of skipping the first try and moving forward 1 step like Case 1, it can match directly on the newline to the left, and whitespace \n to the right:

%n> <%n b%n>
^   \s\s

TL;DR: In your second string, the newline can match beginning of line between your two newlines, and therefore remove them both. In the first string, it cannot match like that because there is a space there, and instead it moves forward 1 step, skipping over the newline and using that newline to match beginning of string. The effect is that the newline is kept in the string.

How can you avoid this behaviour? Well, the problem is that your regex is too loose. \n can match all the components of your regex ^, $ and \s, in various combinations. It can also match in the middle of a string. If you want to be safe and get a predictable result, use regex in a line-by-line mode, do not slurp the file into a single string. Then you do not need multi-line matching, and all your problems go away.

Otherwise, avoid using the multi-line modifier, and just delete leading and trailing whitespace as normal, and then trim inside the string for multiple newlines with spaces, something like s/\n\s*\n/\n/g.

In essence, you are trying to do too many things at the same time. Make your regex stricter, and try to do things one at the time.

Upvotes: 2

Related Questions