Reputation: 11
I want to do some simple string manipulation on a utf8 text file. It will mean taking substrings from a line and outputting them rearranged.
As my linux computer has a utf8 locale and I don't intend to run the program elsewhere setting the locale to utf8 seemed to be the way to go. Adapting an example I got to the test program bellow. If you give it a Greek word it outputs the same but outputing the result of substr just produces garbage. Is there another function I can use or is making use of an utf8 locale totally the wrong way to go?
#include <string>
#include <iostream>
int main()
{
std::string newwd;
setlocale(LC_ALL, "");
std::cout << "Enter greek word ";
std::string wordgr;
std::getline(std::cin, wordgr);
std::cout << "The word is " << wordgr << "." << std::endl;
newwd=wordgr.substr(2,1) ;
std::cout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
return 0;
}
Upvotes: 0
Views: 2164
Reputation: 119877
This works as expected on my system and on IDEOne.
#include <string>
#include <iostream>
int main()
{
std::wstring newwd;
setlocale(LC_ALL, "");
std::wcout << "Enter greek word ";
std::wstring wordgr;
std::getline(std::wcin, wordgr);
std::wcout << "The word is " << wordgr << "." << std::endl;
newwd=wordgr.substr(2,1) ;
std::wcout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
return 0;
}
Upvotes: 1
Reputation: 928
If you would use UTF-8 in your applications you need to consider appropriate library: utf8-cpp. std::string or std::wstring is not an option since UTF-8 chars could have variable length, check wiki for more info.
Here is sample code that prove this concept.
#include <string>
#include <iostream>
#include "source/utf8.h" // path to the utf8-cpp library header
int main()
{
setlocale(LC_ALL, "");
std::cout << "Enter greek word ";
std::string wordgr;
std::getline(std::cin, wordgr);
std::cout << "The word is " << wordgr << "." << std::endl;
std::string::iterator end_it = utf8::find_invalid(wordgr.begin(), wordgr.end());
if (end_it != wordgr.end()) {
std::cout << "Invalid utf-8 encoding" << std::endl;
return 0;
}
// utf-8 string length
std::cout << "Length is " << utf8::distance(wordgr.begin(), end_it) << std::endl;
// utf-8 string symbol traverse
std::string::iterator curr_it = wordgr.begin();
std::string::iterator next_it = curr_it;
utf8::next(next_it, wordgr.end());
while(curr_it != wordgr.end()) {
std::cout << std::string(curr_it, next_it) << " - ";
curr_it = next_it;
if (next_it != wordgr.end()) {
utf8::next(next_it, wordgr.end());
}
}
return 0;
}
Output is as following:
./a.out
Enter greek word Вова
The word is Вова.
Length is 4
В - о - в - а -
Upvotes: 0
Reputation: 576
UTF-8 is a variable-length encoding; a given character in UTF-8 can be between one and six bytes long. This causes the substr() method, which operates on bytes, not characters to produce unexpected results. Greek characters in UTF-8 are NOT one-byte characters. If you input a 4-character greek string and then called std::string.length()
on that word, you would get a result greater than 4 bytes (most likely 8 bytes).
Upvotes: 2