Reputation: 11
Old Question: How SubString,Limit Using C? ,But no one did not answer my question.
i want get one index from a string.
my string may contains symbol and utf-8 character.(eg:ß
)
speed of string for me is important.
w_char_t
data type good for me?#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <string.h>
int main()
{
wchar_t *msg1 = L"ßC Programming";
//wprintf(L" vals> %Ls\n",msg1);
//wprintf(L" vals> %s\n",msg1);
printf(" vals> %Ls %S\n",msg1,msg1);//dont show any=====>BUG
printf(" val> %Lc\n",msg1[1]);//show `C`
printf(" val> %Lc\n",msg1[0]);//dont show any=====>BUG
printf("\n");
/////////////////////////////////
char *msg2 = "ßC Programming";
printf(" vals> %s\n",msg2);//show `ßC Programming`
printf(" val> %c\n",msg2[1]);//show `�`=====>BUG
printf(" val> %c\n",msg2[0]);//show `�`=====>BUG
printf("\n");
}
Please guide me in solving problems.
Upvotes: 1
Views: 2000
Reputation: 25518
wchar_t
can be an option. You should be aware about the encoding it uses, though. If it is 16 bit wide, utf-16 used (common, but not guaranteed) and you are using code points equal to or higher than 0x10000 (U+10000), you have the same problem again...
I personally would rather stay with normal char, though.
Question is now, how to detect multibyte characters. You can spot these by looking at the most significant bit: If it is not set, you have a normal character (ASCII compatible...), if it is set, the byte is part of a multibyte character.
If the second MSB is set, too, it is the start byte of a multi-byte sequence, if it is not set, it is a follow up byte.
Format of a utf-8 multibyte sequence is as follows:
First byte: n most significant bits being set to 1 specify how many bytes the entire sequence comprises, followed by a zero-bit. Remaining bits are the most significant bits of your unicode code point.
Each subsequent byte has 10 as most significant bits, remaining 6 bits are the next most significant bits of your code point.
Example letter 'ß': It has unicode code point 0xdf, binary 0b11011111.
Requiring 8 bits, not fitting into the seven for a single byte character, so we need to split it:
11 + 011111
We need two bytes in total, so we need to add the byte headers 110
and 10
; first byte must then be filled up with zeros:
110 000 11 + 10 011111
So you get the byte sequence 0b11000011, 0b10011111 (hexadecimal: 0xc3, 0x9f).
There are, though, libraries facilitating this. You might be interested in ICU, for instance.
Upvotes: 1