Luka
Luka

Reputation: 1801

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.

If the characters are ASCII, (characters number == size in bytes).

If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?

Input: char *data, int bytes_no
Output: int char_no

Upvotes: 0

Views: 1044

Answers (2)

n. m. could be an AI
n. m. could be an AI

Reputation: 120079

There's no such thing as "character".

Or, more precisely, what "character" is depends on whom you ask.

If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), is a single character. If you take and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.

Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.

Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

Upvotes: 4

phschoen
phschoen

Reputation: 2081

You can use mblen to count the length or use mbstowcs

source:

http://www.cplusplus.com/reference/cstdlib/mblen/

http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod

The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected. A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own. However, the need to count characters arises surprisingly rarely in applications.

you can save a unicode char in a wide char wchar_t

Upvotes: 5

Related Questions