Reputation: 149
I am determining the length of certain strings of characters in C++ with the function length()
, but noticed something strange: say I define in the main
function
string str;
str = "canción";
Then, when I calculate the length of str
by str.length()
I get as output 8
. If instead I define str = "cancion"
and calculate str
's length again, the output is 7
. In other words, the accent on the letter 'o' is altering the real length of the string. The same thing happens with other accents. For example, if str = "für"
it will tell me its length is 4
instead of 3
.
I would like to know how to ignore these accented characters when determinig the lenght of a string; however, I wouldn't want to ignore isolated characters like '
. For example, if str = livin'
, the lenght of str
must be 6
.
Upvotes: 4
Views: 975
Reputation: 29962
It is a difficult subject. Your string is likely UTF-8 encoded, and str.length()
counts bytes. An ASCII character can be encoded in 1 byte, but characters with codes larger than 127 is encoded in more than 1 byte.
Counting unicode code points may not give you the answer you needed. Instead, you need to take account the width of the code point to handle separated accents and code points with double width (and maybe there are other cases as well). So this is difficult to do this properly without using a library.
You may want to check out ICU.
If you have a constrained case and you don't want to use a library for this, you may want to check out UTF-8 encoding (it is not difficult), and create a simple UTF-8 code point counter (a simple algorithm could be to count bytes where (b&0xc0)!=0x80
).
Upvotes: 3
Reputation: 2566
Sounds like UTF-8 encoding. Since the characters with the accents cannot be stored in a single byte, they are stored in 2 bytes. See https://en.wikipedia.org/wiki/UTF-8
Upvotes: 0