Why my program replaces character with 2 spaces?

Question

My Program task is to remove and character <32 and >127 ascii value, but output shows me 2 spaces instead. Example :

input: préféré

expected output : pr f r

my output : pr(2spaces)f(2spaces)r(2spaces)

#include
#include
int main() {
  unsigned char str[100];
  unsigned char space = ' ';
  fgets(str,100,stdin);
  int i=0;
  int length = strlen(str);
  while(i32) && ((int)str[i]<127) )
    {
      i++;
      continue;
    }
    else
    {
        str[i]=space;
    }
    i++;
  }
  printf("%s
",str);
}

rici · Accepted Answer

This seemingly simple problem gets quite complicated if you want to solve it in a portable, locale-aware fashion. On the other hand, if the original text is known to be encoded in UTF-8, the solution is quite simple, particularly if you don't need to detect invalid UTF-8 sequences.

The possible values of bytes in UTF-8 encodings fall into four groups:

single-byte US-ASCII characters: byte values 0x00 through 0x7F, inclusive.
first byte in a multibyte character: values 0xC2 through 0xF4, inclusive.
trailing bytes in multibyte characters: values 0x80 through 0xBF, inclusive.
bytes which cannot appear in any UTF-8 code: everything else (0xC0, 0xC1 and 0xF5 and greater).

Every character therefore contains exactly one byte in the first two sets of values. So a simple strategy is to just delete bytes in the second two sets:

unsigned char* out = str;
for (unsigned char* scan = str; *scan; ++scan) {
  if (*scan >= 0x20 && *scan < 0x7F) {
    // Pass through printable ascii characters
    *out++ = *scan;
  }
  else if (*scan < 0x80 || (*scan >= 0xC2 && *scan <= 0xF4)) {
    // Replace non-printable ascii characters and lead UTF-8 bytes with space
    *out++ = ' ';
  }
  // Anything else is ignored and will be overwritten.
}
*out = 0;

I deleted the supposedly standards-compliant portable code from this answer because it is simply too complicated, and the resulting code is unlikely to be applicable. In general, input to a utility is not guaranteed to conform to the current locale's multibyte encoding: for example, it is at least conceivable that the input is a vector of wchars (for example, a file encoded in UTF-32 on a system with 32-bit wchar). Or that the input is indeed in UTF-8, but the current locale is ISO-8859-7, which is a single-byte encoding. There is no general portable way to convert a wchar (or a multibyte sequence) to "Ascii" in order to test whether a given character is one of the ASCII printable characters in code range 0x20 through 0x7F. (And if this paragraph appears to be unintelligible jargon, that will help explain why it was difficult to write and document a portable solution.)

Why my program replaces character with 2 spaces?

Answers (2)

Related Questions