Reputation: 175
Just now I'm trying to use libunistring in my c program.
I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:
void print_length(uint8_t *msg) {
printf("Default strlen: %d\n", strlen((char *)msg));
printf("U8 strlen: %d\n", u8_strlen(msg));
}
Just imagine that we call print_length()
with msg = "привет"
(cyrillic, utf-8 encoding).
I've expected that strlen()
should return 12 (6 letters * 2 bytes per letter), and
u8_strlen()
should return 6 (just 6 letters).
But I recieved curious results:
Default strlen: 12
U8 strlen: 12
After this I'm tried to lookup u8_strlen realization, and found this code:
size_t
u8_strlen (const uint8_t *s)
{
return strlen ((const char *) s);
}
I'm wondering, is it bug or it's correct answer? If it's correct, why?
Upvotes: 5
Views: 1151
Reputation: 3154
There is also the u8_mbsnlen
function
Function: size_t u8_mbsnlen (const uint8_t *s, size_t n)
Counts and returns the number of Unicode characters in the n units from s.
This function is similar to the gnulib function mbsnlen, except that it operates on Unicode strings.
(link)
Unfortunately this needs you to pass in the length of the string in bytes as well.
Upvotes: 0
Reputation: 151
I believe this is the intended behavior.
The libunistring manual says that:
size_t u8_strlen (const uint8_t *s)
Returns the number of units in s.
Also in the manual, it defines what this "unit" is:
UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).
I believe the reason they label the function u8_strlen
even though it does nothing more than the standard strlen
is that the library also has u16_strlen
and u32_strlen
for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen
simply for completeness.
GNU gnulib does however include mbslen
which probably does what you want:
mbslen function: Determine the number of multibyte characters in a string.
Upvotes: 7