Reputation: 31161
How do you map a single UTF-8 character to its unicode point in C?
[For example, È
would be mapped to 00c8
].
Upvotes: 4
Views: 1015
Reputation: 11821
An reasonably fast implementation of an UTF-8 to UCS-2 converter. Surrogate and characters outside the BMP left as exercice.
The function returns the number of bytes consumed from the input s
string. A negative value represents an error.
The resulting unicode character is put at the address p
points to.
int utf8_to_wchar(wchar_t *p, const char *s)
{
const unsigned char *us = (const unsigned char *)s;
p[0] = 0;
if(!*us)
return 0;
else
if(us[0] < 0x80) {
p[0] = us[0];
return 1;
}
else
if(((us[0] & 0xE0) == 0xC0) && (us[1] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x1F) << 6) | (us[1] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x80) return -2;
#endif
return 2;
}
else
if(((us[0] & 0xF0) == 0xE0) && (us[1] & 0xC0) == 0x80 && (us[2] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x0F) << 12) | ((us[1] & 0x3F) << 6) | (us[2] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x800) return -2;
#endif
return 3;
}
return -1;
}
Upvotes: 0
Reputation: 107739
If your platform's wchar_t
stores unicode (if it's a 32-bit type, it probably does) and you have an UTF-8 locale, you can call mbrtowc
(from C90.1).
mbstate_t state = {0};
wchar_t wch;
char s[] = "\303\210";
size_t n;
memset(&state, 0, sizeof(state));
setlocale(LC_CTYPE, "en_US.utf8"); /*error checking omitted*/
n = mbrtowc(&wch, s, strlen(s), &state);
if (n <= (size_t)-2) printf("%lx\n", (unsigned long)wch);
For more flexibility, you can call the iconv interface.
char s[] = "\303\210";
iconv_t cd = iconv_open("UTF-8", "UCS-4");
if (cd != -1) {
char *inp = s;
size_t ins = strlen(s);
uint32_t c;
uint32_t *outp = &c;
size_t outs = 0;
if (iconv(cd, &inp, &ins, &outp, &outs) + 1 >= 2) printf("%lx\n", c);
iconv_close(cd);
}
Upvotes: 4
Reputation: 2902
Some things to look at :
Upvotes: 2