Reputation: 1109
By default, the end-of-line anchor is supposed to occupy the imaginary position between the last character and the line feed. Why does '\s*$' consume the line feed in the following example?
perl -pe 's/(\.\d{4})\d+\s*$/\1/'
The objective of the above substitution is to truncate digit.5+digits to digit.4digits
e.g.: 123.54321 -> 123.5432
I don't want to waste time transforming ".5+digits non-digits" (e.g.: 5.12345 blah) because it will just fail pre-load validation anyway.
/home/mlibby> echo -e '38492.38945\n5.12345 blah\n624.54321 \n9.325437' | perl -pe 's/(\.\d{4})\d+$/\1/'
38492.3894
5.12345 blah
624.54321
9.3254
However, I do want to transform ".5+digits whitespace" (e.g.: 624.54321 ) because trailing whitespace is valid, but should be trimmed.
So after I consume 5th-to-many digits, I say \s*$
consume zero-or-more whitespace up to the end-of-subject anchor.
/home/mlibby> echo -e '38492.38945\n5.12345 blah\n624.54321 \n9.325437' | perl -pe 's/(\.\d{4})\d+\s*$/\1/'
38492.38945.12345 blah
624.54329.3254/home/mlibby>
So, why is the above search pattern consuming the line feed, causing the substitution to remove LF and ultimately truncate lines?
Granted, I can change my substitution to \1\n
, but the point of this post is to understand what's going on here. By default, $
should anchor west of the line feed. What's going on here?
FYI: Perl version 5.8.8 on RHEL 5.8
Upvotes: 0
Views: 496
Reputation: 240601
$
matches in either of two places: at the end of the string, or immediately before a newline at the end of the string.
A newline is a kind of whitespace, so \s
matches it. So your \s*
consumes any trailing whitespace including the newline, and since $
matches at end of string even if there's not a newline, no backtrack is forced.
You could use a non-greedy match \s*?
to match as little whitespace as possible, thus guaranteeing that it won't eat up the newline that $
is prepared to ignore.
Or you could match any whitespace that isn't a newline, namely [^\S\n]
(if that seems weird, think about De Morgan's law — NOT ((NOT whitespace) OR newline) == whitespace AND (NOT newline)
Upvotes: 3
Reputation: 425318
\s
matches newlines and $
matches end of input (after the very last character)
Change you regex to match only non-newline whitespace (eg spaces and tabs):
perl -pe 's/(\.\d{4})\d+[ \t]*$/\1/'
Upvotes: 0