UTF-8 encoding in C with getchar()

Question

I have to make a code that takes characters with UTF-8 encoding and "translate"them into Unicode. You can check here what a UTF-8 is https://en.wikipedia.org/wiki/UTF-8. I am a C beginner so I have three restrictions placed on me:

I must use getchar()
It is forbidden to use arrays
I am only interested in Unicode characters with 1,2,3 and 4 bytes

So I have this code which is totally functional for 4 bytes(I know I must use != EOF for every getchar(); but for now this is not my problem)

#include 

int main(void) {
        int ch1, ch2, ch3, ch4, c;
        ch1 = getchar();
        ch2 = getchar();
        ch3 = getchar();
        ch4 = getchar();
        if ((ch1 & 0xF8) != 0xF0 || (ch2 & 0xC0) != 0x80 ||
                        (ch3 & 0xC0) != 0x80 || (ch4 & 0xC0) != 0x80) {
                printf("Error in UTF-8 4-byte encoding
");
                return 1;
        }
        c = ((ch1 & 0x07) << 18) | ((ch2 & 0x3F) << 12) |
                        ((ch3 & 0x3F) << 6) | (ch4 & 0x3F);
        printf("c = %05X
", c);
        return 0;
}

My question: I cannot understand how I can use getchar() for 1-2-3 bytes. I mean, I must read all the getchar functions in the beginning and then use ch1 for 1-byte characters and ch1, ch2 for 2 bytes characters OR I must do it like this. (By the way, the code below it is not functional, it gives me an infinite loop; I just use it as a example of my thought.)

#include 

int main (void) {
        int ch1, ch2, ch3, ch4, c;

        if (c >=0x0000 && c<=0x007F ){
             ch1=getchar();
            while (ch1 !=EOF){
                if ((ch1 & 0x80) != 0x00) {
                    printf("Error in UTF-8 1-byte encoding
");
                    return 1;   
                   }
                 c = ((ch1 & 0x80) << 7);
                 printf("c = %05X
", c);
                }
        }

Sami Kuhmonen · Accepted Answer

You can't do it by first reading four characters and then deciding what to do. If the character is in 0x00-0x7f, you'll be throwing the rest out, or you have to handle them in a more difficult way.

The proper way is to read one character. It will tell you how many extra characters you need, if any, based on the most significant bits being 1s. Then read the extra ones and convert to a proper UNICODE code point by shifting and dismissing the most significant bits when needed.

You can check the documentation you linked to to see how the bits of the UNICODE code point are distributed to several bytes. Here is also a brief explanation of the algorithm:

Read one byte
If the topmost bit is zero, there is nothing else to do: the code point is 0x00-0x7f
If the topmost three bits are 110, then you need one extra byte. Take five lowest bits of the first byte, shift them left six bits and OR the lowest six bits from the second byte to get the final value
If the topmost four bits are 1110, then you need two extra bytes. Take four lowest bits of the first one, shift by 12 bits, or in the six lowest bits from the second byte shifted by six, then finally the six lowest bits of the third byte
If the topmost five bits are 11110, then you need three extra bytes and will read them, shift etc as previously
If none of those conditions fit, the data is invalid
Note that when reading extra bytes, those bytes must have 10 as the most significant bits; anything else is invalid.

The lower code won't even work, since c is never given a value, so the if condition will be undefined. It doesn't check the bytes properly either, so that code won't help you much.

UTF-8 encoding in C with getchar()

Answers (1)

Related Questions