Reputation: 1109

Regex Whitespace to Anchor (\s*$) Consumes Line Feed

By default, the end-of-line anchor is supposed to occupy the imaginary position between the last character and the line feed. Why does '\s*$' consume the line feed in the following example?

perl -pe 's/(\.\d{4})\d+\s*$/\1/'

The objective of the above substitution is to truncate digit.5+digits to digit.4digits
e.g.: 123.54321 -> 123.5432

I don't want to waste time transforming ".5+digits non-digits" (e.g.: 5.12345 blah) because it will just fail pre-load validation anyway.

/home/mlibby> echo -e '38492.38945\n5.12345 blah\n624.54321  \n9.325437' | perl -pe 's/(\.\d{4})\d+$/\1/'
38492.3894
5.12345 blah
624.54321
9.3254

However, I do want to transform ".5+digits whitespace" (e.g.: 624.54321 ) because trailing whitespace is valid, but should be trimmed. So after I consume 5th-to-many digits, I say \s*$ consume zero-or-more whitespace up to the end-of-subject anchor.

/home/mlibby> echo -e '38492.38945\n5.12345 blah\n624.54321  \n9.325437' | perl -pe 's/(\.\d{4})\d+\s*$/\1/'
38492.38945.12345 blah
624.54329.3254/home/mlibby>

So, why is the above search pattern consuming the line feed, causing the substitution to remove LF and ultimately truncate lines?

Granted, I can change my substitution to \1\n, but the point of this post is to understand what's going on here. By default, $ should anchor west of the line feed. What's going on here?

FYI: Perl version 5.8.8 on RHEL 5.8

Upvotes: 0

Answers (2)

hobbs

Reputation: 240601

$ matches in either of two places: at the end of the string, or immediately before a newline at the end of the string.

A newline is a kind of whitespace, so \s matches it. So your \s* consumes any trailing whitespace including the newline, and since $ matches at end of string even if there's not a newline, no backtrack is forced.

You could use a non-greedy match \s*? to match as little whitespace as possible, thus guaranteeing that it won't eat up the newline that $ is prepared to ignore.

Or you could match any whitespace that isn't a newline, namely [^\S\n] (if that seems weird, think about De Morgan's law — NOT ((NOT whitespace) OR newline) == whitespace AND (NOT newline)

Upvotes: 3

Bohemian

Reputation: 425318

\s matches newlines and $ matches end of input (after the very last character)

Change you regex to match only non-newline whitespace (eg spaces and tabs):

perl -pe 's/(\.\d{4})\d+[ \t]*$/\1/'

Upvotes: 0

Regex Whitespace to Anchor (\s*$) Consumes Line Feed

Answers (2)

Related Questions