canellas
canellas

Reputation: 697

std::string::size() strange behaviour

I believe the output has to do with UTF, but I do not know how. Would someone, please, explain?

#include <iostream>
#include <cstdint>
#include <iomanip>
#include <string>

int main()
{

    std::cout << "sizeof(char) = " << sizeof(char) << std::endl;
    std::cout << "sizeof(std::string::value_type) = " << sizeof(std::string::value_type) << std::endl;

    std::string _s1 ("abcde");
    std::cout << "s1 = " << _s1 << ", _s1.size() = " << _s1.size() << std::endl;


    std::string _s2 ("abcdé");
    std::cout << "s2 = " << _s2 << ", _s2.size() = " << _s2.size() << std::endl;

    return 0;
}

The output is:

sizeof(char) = 1    
sizeof(std::string::value_type) = 1    
s1 = abcde, _s1.size() = 5    
s2 = abcdé, _s2.size() = 6

g++ --version prints g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609

QTCreator compiles like this:

g++ -c -m32 -pipe -g -std=c++0x -Wall -W -fPIC  -I../strsize -I. -I../../Qt/5.5/gcc/mkspecs/linux-g++-32 -o main.o ../strsize/main.cpp
g++ -m32 -Wl,-rpath,/home/rodrigo/Qt/5.5/gcc -o strsize main.o

Thanks a lot!

Upvotes: 1

Views: 116

Answers (3)

Sergey
Sergey

Reputation: 8238

Even in C++11 std::string has nothing to do with UTF-8. In the description of size and length methods of std::string we can see:

For std::string, the elements are bytes (objects of type char), which are not the same as characters if a multibyte encoding such as UTF-8 is used.

Thus, you should use some third-party unicode-compatible library to handle unicode strings.

If you continue to use non-unicode string classes with unicode strings, you may face LOTS of other problems. For example, you'll get a bogus result when trying to compare same-looking combining character and precomposed character.

Upvotes: 3

Remus Rusanu
Remus Rusanu

Reputation: 294207

gcc default input character set is UTF-8. Your editor also probably saved the file as UTF-8, so in your input .cpp file the string abcdé will have 6 bytes (As Peter already answered, the LATIN SMALL LETTER E WITH ACUTE is encoded in UTF-8 with 2 bytes). std::string::length returns the length in bytes, ie. 6. QED

You should open your source .cpp file in a hex editor to confirm.

Upvotes: 4

Peter Skarpetis
Peter Skarpetis

Reputation: 543

é is encoded as 2 bytes, 0xC3 0xA9, in utf-8.

Upvotes: 4

Related Questions