zzxyz
zzxyz

Reputation: 2981

line anchor behavior with perl regex

I recently wrote a little Perl script to trim whitespace from the end of lines and ran into unexpected behavior. I decided that Perl must include line-end characters when breaking up lines, so tested that theory and got even more unexpected behavior. I do not should either match \s+$ or t$...Not both. Very confused. Can anyone enlighten me?

£ cat example
I have space after me
I do not
£ perl -ne 'print if /\s+$/' example
I have a space after me
I do not
£ perl -ne 'print if /t$/' example
I do not
£

PCRE tester gives expected results. I've also tried the /m suffix with no change in behavior.

edit. for completeness:

£ perl -ne 'print if /e$/' example
£

Expected behavior from perl -ne 'print if...' was the same as grep -P:

£ grep -P '\s+$' example
I have a space after me
£

Can repro under Ubuntu 16.04 perl v5.22.1 (both 60 and 68 patch version) and MINGW perl v5.26.1.

Upvotes: 1

Views: 283

Answers (2)

Eugen Konkov
Eugen Konkov

Reputation: 25263

You see your current behavior because in example file the second line has \n character at the end. \n is the space which matched by \s


perlretut

no modifiers: Default behavior. ... '$' matches only at the end or before a newline at the end.

At your regex \s matches a whitespace character, the set [\ \t\v\r\n\f]. In other words it matches the spaces and \n character. Then $ matches the end of line (no characters, just the position itself). Like word anchor \b matches word boundary, and ^ matches the beginning of the line and not the first character

You could rewrite your regex like this:

/[\t ]+$/

The content of example would look like this if second line didn't end with a \n character:

£ cat example
I have space after me
I do not£

NOTICE that shell prompt £ is not on next line


The results are different because grep abstracts out line endings like Perl's -l flag. (grep -P '\n' will return no results on a text file where grep -Pz '\n' will.)

Upvotes: 5

wp78de
wp78de

Reputation: 18980

Your problems stem from the -n option and the use of \s. The -n flag feeds the input to Perl line by line into $_, then it calls the print if match statement.

In your match you use the $ anchor to match the end of the line. The anchor is purely positional and does not consume the newline or any other character.

Check it yourself here with \s+: Whether your add a $ or not, the regex matches the same number of characters.
This is because \s is equal to [\r\n\t\f\v ] and matches any whitespace character and you have added the + quantifier. So, it matches between one and unlimited times, as many times as possible (greedy).

If you searched just for trailing space characters instead you are good: [ ]+$ (here escaped with a group):

£ perl -ne 'print if /[ ]+$/' example

That way it does not match the \n like \s does. Try it yourself here.

Bonus:

Here are some common Perl one-liners to trim spaces:

# Strip leading whitespace (spaces, tabs) from the beginning of each line
perl -ple 's/^[ \t]+//'
perl -ple 's/^\s+//'

# Strip trailing whitespace (space, tabs) from the end of each line
perl -ple 's/[ \t]+$//'

# Strip whitespace from the beginning and end of each line
perl -ple 's/^[ \t]+|[ \t]+$//g'

Upvotes: 2

Related Questions