user246392
user246392

Reputation: 3009

Creating a Regex to remove consecutive whitespaces except for newlines

I'd like to use a Regex to do the following:

Some examples:

space-space => space
space-space-space => space
space-tab => space
space-tab-space => space
newline-newline => newline-newline
space-newline => newline
space-newline-newline => newline-newline
newline-space => newline
newline-space-newline => newline-newline

The only Regex I could come up with so far was this and it's removing all consecutive whitespaces:

Regex.Replace(input, @"(\s)\s+", "$1");

Upvotes: 1

Views: 1784

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

To match any whitespace but a newline, you may use [^\S\n] negated character class. Or, [\s-[\n]], but I prefer the first one since it is portable to other regex engines.

Now, you may use a regex that will match optional newlines to the left and to the right of 1+ whitespaces other than newline. Then, you may check if any of the newlines were matched, and if yes, omit the whitespace matched, else, replace the match with a space. Then, you will need to replace any chunks of 3 or more newlines with two newlines.

var result = Regex.Replace(input, @"(\n?)[^\S\n]+(\n?)", m =>
    !string.IsNullOrEmpty(m.Groups[1].Value) || !string.IsNullOrEmpty(m.Groups[2].Value) // If any \n matched
        ? $"{m.Groups[1].Value}{m.Groups[2].Value}" // Concat Group 1 and 2 values
        : " ");  // Else, replace the 1+ whitespaces matched with a space
var final_result = Regex.Replace(result, @"\n{3,}", "\n\n"); // Replace 3+ \ns with two \ns

Details

  • (\n?) - Capturing group 1: an optional newline
  • [^\S\n]+ - 1+ whitespaces other than newline
  • (\n?) - Capturing group 2: an optional newline
  • \n{3,} - 3 or more newlines.

Upvotes: 3

AdrianHHH
AdrianHHH

Reputation: 14038

A simple multi-step solution is as follows:

All whitespace characters except for newlines must be converted to a space (i.e. \f, \r, \t, \v will be converted to a space)

output = Regex.Replace(input, "[\\f\\r\\t\\v ]+", " ");

A space is included in the above group.

If a space is preceded or followed by a newline, the space should be removed.

output = Regex.Replace(output, " \n", "\n");
output = Regex.Replace(output, "\n ", "\n"); 

The above two could instead be changed to use String.Replace in the style:

output = output.Replace(" \n", "\n");
output = output.Replace("\n ", "\n");

or even to:

output = output.Replace(" \n", "\n").Replace("\n ", "\n");

A string can not have two or more consecutive whitespaces except for newlines. Newlines are limited to two consecutive occurrences at most(i.e. \n is okay, \n\n is okay too, but \n\n\n is not allowed and should be replaced by \n\n).

output = Regex.Replace(output, "\n\n\n+", "\n\n");

As an aside. If the system uses \r\n for newline sequences then suppressing the \r characters may cause unwanted results.

Upvotes: 0

Related Questions