Reputation: 41
I have to make a code that takes characters with UTF-8 encoding and "translate"them into Unicode. You can check here what a UTF-8 is https://en.wikipedia.org/wiki/UTF-8. I am a C beginner so I have three restrictions placed on me:
getchar()
So I have this code which is totally functional for 4 bytes(I know I must use != EOF
for every getchar();
but for now this is not my problem)
#include <stdio.h>
int main(void) {
int ch1, ch2, ch3, ch4, c;
ch1 = getchar();
ch2 = getchar();
ch3 = getchar();
ch4 = getchar();
if ((ch1 & 0xF8) != 0xF0 || (ch2 & 0xC0) != 0x80 ||
(ch3 & 0xC0) != 0x80 || (ch4 & 0xC0) != 0x80) {
printf("Error in UTF-8 4-byte encoding\n");
return 1;
}
c = ((ch1 & 0x07) << 18) | ((ch2 & 0x3F) << 12) |
((ch3 & 0x3F) << 6) | (ch4 & 0x3F);
printf("c = %05X\n", c);
return 0;
}
My question: I cannot understand how I can use getchar()
for 1-2-3 bytes. I mean, I must read all the getchar
functions in the beginning and then use ch1
for 1-byte characters and ch1
, ch2
for 2 bytes characters OR I must do it like this. (By the way, the code below it is not functional, it gives me an infinite loop; I just use it as a example of my thought.)
#include <stdio.h>
int main (void) {
int ch1, ch2, ch3, ch4, c;
if (c >=0x0000 && c<=0x007F ){
ch1=getchar();
while (ch1 !=EOF){
if ((ch1 & 0x80) != 0x00) {
printf("Error in UTF-8 1-byte encoding\n");
return 1;
}
c = ((ch1 & 0x80) << 7);
printf("c = %05X\n", c);
}
}
Upvotes: 2
Views: 3775
Reputation: 31153
You can't do it by first reading four characters and then deciding what to do. If the character is in 0x00-0x7f, you'll be throwing the rest out, or you have to handle them in a more difficult way.
The proper way is to read one character. It will tell you how many extra characters you need, if any, based on the most significant bits being 1s. Then read the extra ones and convert to a proper UNICODE code point by shifting and dismissing the most significant bits when needed.
You can check the documentation you linked to to see how the bits of the UNICODE code point are distributed to several bytes. Here is also a brief explanation of the algorithm:
110
, then you need one extra byte. Take five lowest bits of the first byte, shift them left six bits and OR the lowest six bits from the second byte to get the final value1110
, then you need two extra bytes. Take four lowest bits of the first one, shift by 12 bits, or in the six lowest bits from the second byte shifted by six, then finally the six lowest bits of the third byte11110
, then you need three extra bytes and will read them, shift etc as previously10
as the most significant bits; anything else is invalid.The lower code won't even work, since c
is never given a value, so the if
condition will be undefined. It doesn't check the bytes properly either, so that code won't help you much.
Upvotes: 7