Reputation: 532
I currently have a string, say $line='55.25040882, 3,,,,,,'
, that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3'
, but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!
Upvotes: 3
Views: 408
Reputation: 213411
You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,}
-> This means either [^\S\n\r]
or [.,]{2,}
[.,]{2,}
-> This means replace ,
or .
if there is more than 2
in the same
line.[^\S\n\r]
-> Means negate all whitespace character
, linefeed, and newline.Upvotes: 2
Reputation: 75272
This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^
). Specifically:
[(^\n^\r)\s]
matches (
or ^
or )
or any whitespace character, including linefeed (\n
) and carriage return (\r
). In fact, they're each specified twice (since \s
matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s
matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^
) negates the meaning of everything that follows iff it appears immediately after the opening [
; anywhere else, it's just a caret. All other metacharacters except \
lose their special meanings entirely inside character classes. (But the normally non-special characters, -
and ]
, become special.)
Outside a character class, ^
is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}
, I assume you want to match repeated periods or repeated commas, not things like .,
or ,.
. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,}
is not incorrect, but why add all that clutter to your regex when +
does the same thing?
\h
matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h
matches a hex digit; in every other flavor I know of, it's a syntax error.)
Upvotes: 3