navarian
navarian

Reputation: 41

UTF-8 encoding in C with getchar()

I have to make a code that takes characters with UTF-8 encoding and "translate"them into Unicode. You can check here what a UTF-8 is https://en.wikipedia.org/wiki/UTF-8. I am a C beginner so I have three restrictions placed on me:

  1. I must use getchar()
  2. It is forbidden to use arrays
  3. I am only interested in Unicode characters with 1,2,3 and 4 bytes

So I have this code which is totally functional for 4 bytes(I know I must use != EOF for every getchar(); but for now this is not my problem)

#include <stdio.h>

int main(void) {
        int ch1, ch2, ch3, ch4, c;
        ch1 = getchar();
        ch2 = getchar();
        ch3 = getchar();
        ch4 = getchar();
        if ((ch1 & 0xF8) != 0xF0 || (ch2 & 0xC0) != 0x80 ||
                        (ch3 & 0xC0) != 0x80 || (ch4 & 0xC0) != 0x80) {
                printf("Error in UTF-8 4-byte encoding\n");
                return 1;
        }
        c = ((ch1 & 0x07) << 18) | ((ch2 & 0x3F) << 12) |
                        ((ch3 & 0x3F) << 6) | (ch4 & 0x3F);
        printf("c = %05X\n", c);
        return 0;
}

My question: I cannot understand how I can use getchar() for 1-2-3 bytes. I mean, I must read all the getchar functions in the beginning and then use ch1 for 1-byte characters and ch1, ch2 for 2 bytes characters OR I must do it like this. (By the way, the code below it is not functional, it gives me an infinite loop; I just use it as a example of my thought.)

#include <stdio.h>

int main (void) {
        int ch1, ch2, ch3, ch4, c;

        if (c >=0x0000 && c<=0x007F ){
             ch1=getchar();
            while (ch1 !=EOF){
                if ((ch1 & 0x80) != 0x00) {
                    printf("Error in UTF-8 1-byte encoding\n");
                    return 1;   
                   }
                 c = ((ch1 & 0x80) << 7);
                 printf("c = %05X\n", c);
                }
        }

Upvotes: 2

Views: 3775

Answers (1)

Sami Kuhmonen
Sami Kuhmonen

Reputation: 31153

You can't do it by first reading four characters and then deciding what to do. If the character is in 0x00-0x7f, you'll be throwing the rest out, or you have to handle them in a more difficult way.

The proper way is to read one character. It will tell you how many extra characters you need, if any, based on the most significant bits being 1s. Then read the extra ones and convert to a proper UNICODE code point by shifting and dismissing the most significant bits when needed.

You can check the documentation you linked to to see how the bits of the UNICODE code point are distributed to several bytes. Here is also a brief explanation of the algorithm:

  • Read one byte
  • If the topmost bit is zero, there is nothing else to do: the code point is 0x00-0x7f
  • If the topmost three bits are 110, then you need one extra byte. Take five lowest bits of the first byte, shift them left six bits and OR the lowest six bits from the second byte to get the final value
  • If the topmost four bits are 1110, then you need two extra bytes. Take four lowest bits of the first one, shift by 12 bits, or in the six lowest bits from the second byte shifted by six, then finally the six lowest bits of the third byte
  • If the topmost five bits are 11110, then you need three extra bytes and will read them, shift etc as previously
  • If none of those conditions fit, the data is invalid
  • Note that when reading extra bytes, those bytes must have 10 as the most significant bits; anything else is invalid.

The lower code won't even work, since c is never given a value, so the if condition will be undefined. It doesn't check the bytes properly either, so that code won't help you much.

Upvotes: 7

Related Questions