How can I print a string with the same length with or without multicharacters?

Question

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.

As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:

ö ş ç ğ

So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?

Christophe · Accepted Answer

What happens in your code ?

The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.

Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:

the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.

So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.

How to solve it ?

There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.

I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.

Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.

How can I print a string with the same length with or without multicharacters?

Answers (2)

Related Questions