Reputation: 1076
I am trying to use the boost::locale
library to perform uppercase and lowercase conversion of strings in my code (version 1.71).
I have an issue with the capitalization of "ß". In order to be compliant with already existing unit tests in my codebase, I want the letter "ß" to be capitalized to "SS". This should not be an issue, since this is the expected behavior as far as I understand (https://www.boost.org/doc/libs/1_71_0/libs/locale/doc/html/conversions.html).
Here is a copy of the example provided on this page for reference:
Upper GRÜSSEN
Lower grüßen
Title Grüßen
Fold grüssen
However, this is not the case when I use the method in my code. The "ß" stays as "ß" when applying the uppercase method.
I was confused and found the following example in the boost::locale
library source:
//
// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
//
// Distributed under the Boost Software License, Version 1.0. (See
// accompanying file LICENSE_1_0.txt or copy at
// http://www.boost.org/LICENSE_1_0.txt)
//
#include <boost/locale.hpp>
#include <boost/algorithm/string/case_conv.hpp>
#include <iostream>
#include <ctime>
int main()
{
using namespace boost::locale;
using namespace std;
// Create system default locale
generator gen;
locale loc=gen("");
locale::global(loc);
cout.imbue(loc);
cout<<"Correct case conversion can't be done by simple, character by character conversion"<<endl;
cout<<"because case conversion is context sensitive and not 1-to-1 conversion"<<endl;
cout<<"For example:"<<endl;
cout<<" German grüßen correctly converted to "<<to_upper("grüßen")<<", instead of incorrect "
<<boost::to_upper_copy(std::string("grüßen"))<<endl;
cout<<" where ß is replaced with SS"<<endl;
cout<<" Greek ὈΔΥΣΣΕΎΣ is correctly converted to "<<to_lower("ὈΔΥΣΣΕΎΣ")<<", instead of incorrect "
<<boost::to_lower_copy(std::string("ὈΔΥΣΣΕΎΣ"))<<endl;
cout<<" where Σ is converted to σ or to ς, according to position in the word"<<endl;
cout<<"Such type of conversion just can't be done using std::toupper that work on character base, also std::toupper is "<<endl;
cout<<"not even applicable when working with variable character length like in UTF-8 or UTF-16 limiting the correct "<<endl;
cout<<"behavior to unicode subset BMP or ASCII only"<<endl;
}
// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
// boostinspect:noascii
I tried compiling it, and this is the result I get:
Correct case conversion can't be done by simple, character by character conversion
because case conversion is context sensitive and not 1-to-1 conversion
For example:
German grüßen correctly converted to GRÜßEN, instead of incorrect GRüßEN
where ß is replaced with SS
Greek ὈΔΥΣΣΕΎΣ is correctly converted to ὀδυσσεύσ, instead of incorrect ὈΔΥΣΣΕΎΣ
where Σ is converted to σ or to ς, according to position in the word
Such type of conversion just can't be done using std::toupper that work on character base, also std::toupper is
not even applicable when working with variable character length like in UTF-8 or UTF-16 limiting the correct
behavior to unicode subset BMP or ASCII only
Emphasis on the part:
German grüßen correctly converted to GRÜßEN, instead of incorrect GRüßEN
where ß is replaced with SS
I really don't get what is going on in this sentence. What is the actual expected behavior?
Upvotes: 2
Views: 230
Reputation: 1076
The documentation has been updated in the following commit: https://github.com/Flamefire/locale/commit/bae1f380ad0719121dfe048c56119bf72e074144
It now reads:
German grüßen would be incorrectly converted to GRÜßEN, while Boost.Locale converts it to GRÜSSEN where ß is replaced with SS.
So the expected behavior is indeed to capitalize "ß" as "SS".
I assume this wasn't the case in my code because I didn't compile boost.locale with the ICU backend. I no longer have access to the original code, so I can't confirm this theory.
Upvotes: 1