Carl Rojas
Carl Rojas

Reputation: 149

How to ignore accents in a string so it does not alter its length?

I am determining the length of certain strings of characters in C++ with the function length(), but noticed something strange: say I define in the main function

string str;
str = "canción";

Then, when I calculate the length of str by str.length() I get as output 8. If instead I define str = "cancion" and calculate str's length again, the output is 7. In other words, the accent on the letter 'o' is altering the real length of the string. The same thing happens with other accents. For example, if str = "für" it will tell me its length is 4 instead of 3.

I would like to know how to ignore these accented characters when determinig the lenght of a string; however, I wouldn't want to ignore isolated characters like '. For example, if str = livin', the lenght of str must be 6.

Upvotes: 4

Views: 975

Answers (2)

geza
geza

Reputation: 29962

It is a difficult subject. Your string is likely UTF-8 encoded, and str.length() counts bytes. An ASCII character can be encoded in 1 byte, but characters with codes larger than 127 is encoded in more than 1 byte.

Counting unicode code points may not give you the answer you needed. Instead, you need to take account the width of the code point to handle separated accents and code points with double width (and maybe there are other cases as well). So this is difficult to do this properly without using a library.

You may want to check out ICU.

If you have a constrained case and you don't want to use a library for this, you may want to check out UTF-8 encoding (it is not difficult), and create a simple UTF-8 code point counter (a simple algorithm could be to count bytes where (b&0xc0)!=0x80).

Upvotes: 3

DBug
DBug

Reputation: 2566

Sounds like UTF-8 encoding. Since the characters with the accents cannot be stored in a single byte, they are stored in 2 bytes. See https://en.wikipedia.org/wiki/UTF-8

Upvotes: 0

Related Questions