Artem Agasiev
Artem Agasiev

Reputation: 175

libunistring u8_strlen() equals to strlen()?

Just now I'm trying to use libunistring in my c program. I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:

void print_length(uint8_t *msg) {
    printf("Default strlen: %d\n", strlen((char *)msg));
    printf("U8 strlen: %d\n", u8_strlen(msg));
}

Just imagine that we call print_length() with msg = "привет" (cyrillic, utf-8 encoding). I've expected that strlen() should return 12 (6 letters * 2 bytes per letter), and u8_strlen() should return 6 (just 6 letters).

But I recieved curious results:

Default strlen: 12
U8 strlen: 12

After this I'm tried to lookup u8_strlen realization, and found this code:

size_t
u8_strlen (const uint8_t *s)
{
    return strlen ((const char *) s);
}

I'm wondering, is it bug or it's correct answer? If it's correct, why?

Upvotes: 5

Views: 1151

Answers (2)

Gavin Smith
Gavin Smith

Reputation: 3154

There is also the u8_mbsnlen function

Function: size_t u8_mbsnlen (const uint8_t *s, size_t n)

Counts and returns the number of Unicode characters in the n units from s.

This function is similar to the gnulib function mbsnlen, except that it operates on Unicode strings.

(link)

Unfortunately this needs you to pass in the length of the string in bytes as well.

Upvotes: 0

Berry
Berry

Reputation: 151

I believe this is the intended behavior.

The libunistring manual says that:

size_t u8_strlen (const uint8_t *s)

Returns the number of units in s.

Also in the manual, it defines what this "unit" is:

UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).

I believe the reason they label the function u8_strlen even though it does nothing more than the standard strlen is that the library also has u16_strlen and u32_strlen for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen simply for completeness.

GNU gnulib does however include mbslen which probably does what you want:

mbslen function: Determine the number of multibyte characters in a string.

Upvotes: 7

Related Questions