marw
marw

Reputation: 413

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.

This here does not work for non-ASCII characters:

wchar_t currentLetter = word.at(0);  

Because it returns two characters (in a loop) for characters such as German Umlauts.

This here does not work, either:

wchar_t currentLetter = word.substr(0,1);

error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'

And neither does this:

wchar_t currentLetter = word.substr(0,1).c_str();

error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'

Any other ideas?

Cheers,

Martin

---- Update ----- Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:

#include <iostream>
using namespace std;

int main() {
    wstring word = L"für";
    wcout << word << endl;
    wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;

    wchar_t currentLetter;
    bool isLastLetter;

    do {
        isLastLetter = ( word.length() == 1 );
        currentLetter = word.at(0);
        wcout << L"Letter: " << currentLetter << endl;

        word = word.substr(1, word.length()); // remove first letter
    } while (word.length() > 0);

    return EXIT_SUCCESS;
}

However, the actual output I get is:

f?r ? ? ? Letter: f Letter: ? Letter: r

The source file is encoded in UTF8 and the console's encoding is also set to UTF8.

Upvotes: 0

Views: 4882

Answers (1)

user1508519
user1508519

Reputation:

Here's a solution provided by Sehe:

#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>

using namespace std;

template <typename C>
std::string to_utf8(C const& in)
{
    std::string result;
    auto out = std::back_inserter(result);
    auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);

    std::copy(begin(in), end(in), utf8out);
    return result;
}

int main() {
    wstring word = L"für";

    bool isLastLetter;

    do {
        isLastLetter = ( word.length() == 1 );
        auto currentLetter = to_utf8(word.substr(0, 1));
        cout << "Letter: " << currentLetter << endl;

        word = word.substr(1, word.length()); // remove first letter
    } while (word.length() > 0);

    return EXIT_SUCCESS;
}

Output:

Letter: f

Letter: ü

Letter: r

Yes you need Boost, but it seems that you're going to need an external library anyway.

1

C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString class) or Qt (QString class), both support Unicode, including UTF-8.

2

Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.

3

The C++ language standard has no notion of explicit encodings. It only contains an opaque notion of a "system encoding", for which wchar_t is a "sufficiently large" type.

To convert from the opaque system encoding to an explicit external encoding, you must use an external library. The library of choice would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and available on many platforms, although on Windows the WideCharToMultibyte functions is guaranteed to produce UTF8.

C++11 adds new UTF8 literals in the form of std::string s = u8"Hello World: \U0010FFFF";. Those are already in UTF8, but they cannot interface with the opaque wstring other than through the way I described.

4 (about source files but still sorta relevant)

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal character names and look like \uffff or \Uffffffff and can be used to refer to unicode characters. A subset of them are usable in identifiers (listed in Annex E).

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used.

Upvotes: 1

Related Questions