GoWorkCode
GoWorkCode

Reputation: 11

How Get a Utf-8 Char from String Using C?

Old Question: How SubString,Limit Using C? ,But no one did not answer my question.

i want get one index from a string.

my string may contains symbol and utf-8 character.(eg:ß)

speed of string for me is important.

1#: w_char_t data type good for me?

2#: how can get a character from a utf-8 string?

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <string.h>

int main()
{
wchar_t *msg1 = L"ßC Programming";
//wprintf(L" vals> %Ls\n",msg1);
//wprintf(L" vals> %s\n",msg1);
printf(" vals> %Ls %S\n",msg1,msg1);//dont show any=====>BUG
printf(" val> %Lc\n",msg1[1]);//show `C`
printf(" val> %Lc\n",msg1[0]);//dont show any=====>BUG
printf("\n");
/////////////////////////////////
char *msg2 = "ßC Programming";
printf(" vals> %s\n",msg2);//show `ßC Programming`
printf(" val> %c\n",msg2[1]);//show `�`=====>BUG
printf(" val> %c\n",msg2[0]);//show `�`=====>BUG
printf("\n");
}

Please guide me in solving problems.

Upvotes: 1

Views: 2000

Answers (1)

Aconcagua
Aconcagua

Reputation: 25518

wchar_t can be an option. You should be aware about the encoding it uses, though. If it is 16 bit wide, utf-16 used (common, but not guaranteed) and you are using code points equal to or higher than 0x10000 (U+10000), you have the same problem again...

I personally would rather stay with normal char, though.

Question is now, how to detect multibyte characters. You can spot these by looking at the most significant bit: If it is not set, you have a normal character (ASCII compatible...), if it is set, the byte is part of a multibyte character.

If the second MSB is set, too, it is the start byte of a multi-byte sequence, if it is not set, it is a follow up byte.

Format of a utf-8 multibyte sequence is as follows:

First byte: n most significant bits being set to 1 specify how many bytes the entire sequence comprises, followed by a zero-bit. Remaining bits are the most significant bits of your unicode code point.

Each subsequent byte has 10 as most significant bits, remaining 6 bits are the next most significant bits of your code point.

Example letter 'ß': It has unicode code point 0xdf, binary 0b11011111.

Requiring 8 bits, not fitting into the seven for a single byte character, so we need to split it:

11 + 011111

We need two bytes in total, so we need to add the byte headers 110 and 10; first byte must then be filled up with zeros:

110 000 11 + 10 011111

So you get the byte sequence 0b11000011, 0b10011111 (hexadecimal: 0xc3, 0x9f).

There are, though, libraries facilitating this. You might be interested in ICU, for instance.

Upvotes: 1

Related Questions