hrichardson
hrichardson

Reputation: 33

How to replace spans of whitespace with a single space in a C string?

I am trying to write a function that turns all spans of whitespace (i.e multiple spaces, a newline, a tab, or any continuous sequence of the aforementioned) into a single space.

For example, the following inputs:

"example\tinput\tstring"
"example\ninput\nstring"
"example  \tinput \n string"

Would all result in the same output: "example input string"

I currently have the following code based on a similar question here on Stack Overflow (https://stackoverflow.com/a/1217750/6652030, the 2nd answer). It handles sequences of multiple spaces correctly, but it doesn't replace tabs and newlines with spaces as intended. If I pass the first two example inputs, my resulting string is "exampleinputstring". Any thoughts on where I'm going wrong?

void removeExtraWhitespace(char *src, char *dst) {
  for (; *src; ++dst, ++src) {
    *dst = *src;
    if (*src == '\n' || *src == '\t') {
      *src = ' ';
    }

    else if (isspace(*src)) {
      while (!isspace(*(src + 1))) {
        ++src;
      }
    }
  }

  *dst = '\0';
}

Upvotes: 1

Views: 207

Answers (4)

assefamaru
assefamaru

Reputation: 2789

You can make the following modification to your code:

void removeExtraWhitespace(char *dst, const char *src) {
  for(; *src; ++dst, ++src) {
    if (isspace(*src)) {
      *dst = ' ';
      while (isspace(*(src + 1))) {
        ++src;
      }
    } else {
      *dst = *src;
    }
  }
  *dst = '\0';
}

For example,

char dst[50];
$ removeExtraWhitespace(dst, "\t\t\texample\t\t\n    input\nstring\n\n   ");
$ printf("%s\n", dst);
 example input string 

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753585

The code I came up with for the problem was:

static
void removeExtraWhiteSpace(char *src, char *dst)
{
    while (*src != '\0')
    {
        if (!isspace((unsigned char)*src))
            *dst++ = *src++;
        else
        {
            *dst++ = ' ';
            while (isspace((unsigned char)*src))
                src++;
        }
    }
    *dst = '\0';
}

For each character, if it isn't a space according to isspace() from <ctype.h>, copy it to the output. If it is a space, copy a space to the output, and skip over any following space characters. When finished, add the null terminator.

The function is static because I make all functions static unless there's a header that declares the function for use in other files — the compilation options I use require this discipline (or a prior extern void removeExtraWhiteSpace(char *src, char *dst) declaration for the function — but static is shorter.

If you want to remove leading and trailing blanks, it isn't much harder:

static
void removeExtraWhiteSpace(char *src, char *dst)
{
    char *tgt = dst;
    while (isspace((unsigned char)*src))
        src++;
    while (*src != '\0')
    {
        if (!isspace((unsigned char)*src))
            *tgt++ = *src++;
        else
        {
            *tgt++ = ' ';
            while (isspace((unsigned char)*src))
                src++;
        }
    }
    *tgt = '\0';
    if (tgt > dst && tgt[-1] == ' ')
        tgt[-1] = '\0';
}

Test code:

static void test_string(char *buffer1)
{
    printf("Before [%s]\n", buffer1);
    char buffer2[1024];
    removeExtraWhiteSpace(buffer1, buffer2);
    printf("After  [%s]\n", buffer2);
}

int main(void)
{
    test_string("example\tinput\tstring");
    test_string("example\ninput\nstring");
    test_string("example  \tinput \n string");
    test_string("  \t spaces\t \tand tabs\tboth  before\t\tand  \t \t after  \t\t ");

#ifdef GO_INTERACTIVE
    char buffer[1024];
    while (fgets(buffer, sizeof(buffer), stdin) != 0)
    {
        buffer[strcspn(buffer, "\n")] = '\0';
        test_string(buffer);
    }
#endif /* GO_INTERACTIVE */

    return 0;
}

Plain output:

Before [example input   string]
After  [example input string]
Before [example
input
string]
After  [example input string]
Before [example     input 
 string]
After  [example input string]
Before [     spaces     and tabs    both  before        and          after           ]
After  [ spaces and tabs both before and after ]

With tabs and newlines marked (^I for tabs, ^J for newlines):

Before [example^Iinput^Istring]^J
After  [example input string]^J
Before [example^J
input^J
string]^J
After  [example input string]^J
Before [example  ^Iinput ^J
 string]^J
After  [example input string]^J
Before [  ^I spaces^I ^Iand tabs^Iboth  before^I^Iand  ^I ^I after  ^I^I ]^J
After  [ spaces and tabs both before and after ]^J

Upvotes: 0

chux
chux

Reputation: 153348

How to replace spans of whitespace with a single space in a C string?

Simply keep track of the previous action and test for whitespace. Only 1 tight loop needed, one call to isspace(). This also handles leading/trailing whitespace.

#include <ctype.h>
#include <stdbool.h>

void removeExtraWhitespace(const char *src, char *dst) {
  bool previous_was_whitespace = false;
  while (*src) {
    if (isspace((unsigned char) *src)) {
      if (!previous_was_whitespace) {
        *dst++ = ' ';
      }
      previous_was_whitespace = true;
    } else {
      *dst++ = *src;
      previous_was_whitespace = false;
    }
    src++;
  }
  *dst = '\0';
}

Any thoughts on where I'm going wrong?

When OP's code first encounters a '\n', '\t', it changes src[], but that never effect dst[].

Also drop the else in the below code. This allows consumption of consecutive whitespace after '\n', '\t'. Yet this still has trouble other white-spaces such as '\r'.

if (*src == '\n' || *src == '\t') {
  *src = ' ';
}

// else if (isspace(*src)) {
if (isspace(*src)) {

This code uses isspace((unsigned char) *src) rather than isspace(*src) as isspace() is only defined for values in the unsigned char range and EOF. With learner programs, it is unusual to encounter negative values for *src, yet they can exist and conversion to the unsigned char range is prudent.

Upvotes: 0

YePhIcK
YePhIcK

Reputation: 5856

In essence you are copying a string into a new place while collapsing all the whitespace (and newline) characters into a single character (if I understood you correctly).

While you are copying over you have three possible "modes of operation":

  1. No whitespace (straight copy)
  2. "Whitespace" mode - you have encountered a whitespace character and are currently skipping till you see a non-whitespace again
  3. You are done skipping the whitespaces and going back to "straight copy" mode

Putting that into pseudocode, this looks like the following:

bool skipping = false;
for(each char){
  if(iswhite(char)){
    skipping = true;
    continue; // next source char
  }
  // not a whitespace, let's see if we are done skipping
  if(skipping){
    // collapse all those skipped whitespaces into one
    copy_space_into_dest;
  }
  skipping = false;
  copy_char_into_dest;
}

The iswhite() above could be a call to isspace() or your own function that will return true for anything it considers to be a white space (for example if you decide that a _ is a "white space")

Upvotes: 0

Related Questions