Richard Topchii
Richard Topchii

Reputation: 8185

C++ std::string capitalize in non-latin language (without third-party libraries)

Considering the method:

void Capitalize(std::string &s)
{
    bool shouldCapitalize = true;

    for(size_t i = 0; i < s.size(); i++)
    {
        if (iswalpha(s[i]) && shouldCapitalize == true)
        {
            s[i] = (char)towupper(s[i]);
            shouldCapitalize = false;
        }
        else if (iswspace(s[i]))
        {
            shouldCapitalize = true;
        }
    }
}

It works perfectly for ASCII characters, e.g.

"steve" -> "Steve"

However, once I'm using a non-latin characters, e.g. as with Cyrillic alphabet, I'm not getting that result:

"стив" -> "стив"

What is the reason why that method fails for non-latin alphabets? I've tried using methods such as isalpha as well as iswalpha but I'm getting exactly the same result.

What would be a way to modify this method to capitalize non-latin alphabets?

Note: Unfortunately, I'd prefer to solve this issue without using a third party library such as icu4c, otherwise it would have been a very simple problem to solve.

Update:

This solution doesn't work (for some reason):

void Capitalize(std::string &s)
{
    bool shouldCapitalize = true;
    std::locale loc("ru_RU"); // Creating a locale that supports cyrillic alphabet

    for(size_t i = 0; i < s.size(); i++)
    {
        if (isalpha(s[i], loc) && shouldCapitalize == true)
        {
            s[i] = (char)toupper(s[i], loc);
            shouldCapitalize = false;
        }
        else if (isspace(s[i], loc))
        {
            shouldCapitalize = true;
        }
    }
}

Upvotes: 2

Views: 138

Answers (2)

sklott
sklott

Reputation: 2859

std::locale works, at least where it is present in system. Also you use it incorrectly.

This code works as expected on Ubuntu with Russian locale installed:

#include <iostream>
#include <locale>
#include <string>
#include <codecvt>

void Capitalize(std::wstring &s)
{
    bool shouldCapitalize = true;
    std::locale loc("ru_RU.UTF-8"); // Creating a locale that supports cyrillic alphabet

    for(size_t i = 0; i < s.size(); i++)
    {
        if (isalpha(s[i], loc) && shouldCapitalize == true)
        {
            s[i] = toupper(s[i], loc);
            shouldCapitalize = false;
        }
        else if (isspace(s[i], loc))
        {
            shouldCapitalize = true;
        }
    }
}

int main()
{
    std::wstring in = L"это пример текста";
    Capitalize(in);
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
    std::string out = conv1.to_bytes(in);
    std::cout << out << "\n";
    return 0;
}

Its possible that on Windows you need to use other locale name, I'm not sure.

Upvotes: 2

TopchetoEU
TopchetoEU

Reputation: 694

Well, an external library would be the only practical choice IMHO. The standard functions works well with Latin, and any other locale would be a pain, and I wouldn't bother. Still, if you want support for Latin and Cyrillic without an external library, you can just write it yourself:

wchar_t to_upper(wchar_t c) {
    // Latin
    if (c >= L'a' && c <= L'z') return c - L'a' + L'A';
    // Cyrillic
    if (c >= L'а' && c <= L'я') return c - L'а' + L'А';

    return towupper(c);
}

Still, it's important to note that you need to painstakingly implement support for all alphabets, and even not all latin characters are supported, so an external library is the best solution. Consider the given solution if you're sure only English and Russian are going to be used.

Upvotes: 0

Related Questions