daivid
daivid

Reputation: 11

C++ string manipulation withing an utf8 locale

I want to do some simple string manipulation on a utf8 text file. It will mean taking substrings from a line and outputting them rearranged.

As my linux computer has a utf8 locale and I don't intend to run the program elsewhere setting the locale to utf8 seemed to be the way to go. Adapting an example I got to the test program bellow. If you give it a Greek word it outputs the same but outputing the result of substr just produces garbage. Is there another function I can use or is making use of an utf8 locale totally the wrong way to go?

    #include <string>
    #include <iostream>

    int main()
    {
        std::string newwd;
        setlocale(LC_ALL, "");
        std::cout << "Enter greek word ";
        std::string wordgr;
        std::getline(std::cin, wordgr);
        std::cout << "The word is " << wordgr << "." << std::endl;
        newwd=wordgr.substr(2,1) ;
        std::cout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
        return 0;
    } 

Upvotes: 0

Views: 2164

Answers (3)

n. m. could be an AI
n. m. could be an AI

Reputation: 119877

This works as expected on my system and on IDEOne.

#include <string>
#include <iostream>

int main()
{
    std::wstring newwd;
    setlocale(LC_ALL, "");
    std::wcout << "Enter greek word ";
    std::wstring wordgr;
    std::getline(std::wcin, wordgr);
    std::wcout << "The word is " << wordgr << "." << std::endl;
    newwd=wordgr.substr(2,1) ;
    std::wcout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
    return 0;
}

Upvotes: 1

vershov
vershov

Reputation: 928

If you would use UTF-8 in your applications you need to consider appropriate library: utf8-cpp. std::string or std::wstring is not an option since UTF-8 chars could have variable length, check wiki for more info.

Here is sample code that prove this concept.

#include <string>
#include <iostream>
#include "source/utf8.h" // path to the utf8-cpp library header

int main()
{
        setlocale(LC_ALL, "");
        std::cout << "Enter greek word ";
        std::string wordgr;
        std::getline(std::cin, wordgr);
        std::cout << "The word is " << wordgr << "." << std::endl;
        std::string::iterator end_it = utf8::find_invalid(wordgr.begin(), wordgr.end());
        if (end_it != wordgr.end()) {
                std::cout << "Invalid utf-8 encoding" << std::endl;
                return 0;
        }
        // utf-8 string length
        std::cout << "Length is " << utf8::distance(wordgr.begin(), end_it) << std::endl;

        // utf-8 string symbol traverse
        std::string::iterator curr_it = wordgr.begin();
        std::string::iterator next_it = curr_it;
        utf8::next(next_it, wordgr.end());
        while(curr_it != wordgr.end()) {
                std::cout << std::string(curr_it, next_it) << " - ";
                curr_it = next_it;
                if (next_it != wordgr.end()) {
                        utf8::next(next_it, wordgr.end());
                }
        }
        return 0;
}

Output is as following:

./a.out 
Enter greek word Вова
The word is Вова.
Length is 4
В - о - в - а -

Upvotes: 0

Ken P
Ken P

Reputation: 576

UTF-8 is a variable-length encoding; a given character in UTF-8 can be between one and six bytes long. This causes the substr() method, which operates on bytes, not characters to produce unexpected results. Greek characters in UTF-8 are NOT one-byte characters. If you input a 4-character greek string and then called std::string.length() on that word, you would get a result greater than 4 bytes (most likely 8 bytes).

Upvotes: 2

Related Questions