Reputation: 1379
I'm working on a project wherein the case sensitive operations needs to be replaced with case insensitive operations. After doing some reading on this, the type of data to be considered are:
Please let me know if I've missed anything in the list.
Do the above need to be handled separately or are there libraries for C++ which can handle them all without concerning the type of data?
Specifically:
Does the boost library provide support for this? If so, are there sample examples or documentation on how to use the APIs?
I learned about IBM's International Components of Unicode (ICU). Is this a library that provides support for case insensitive operations? If so, are there sample examples or documentation on how to use the APIs?
Finally, which among the aforementioned (and other) approaches is better and why?
Thanks!
Based on the comments and answers, I wrote a sample program to understand this better:
#include <iostream> // std::cout
#include <string> // std::string
#include <locale> // std::locale, std::tolower
using namespace std;
void ascii_to_lower(string& str)
{
std::locale loc;
std::cout << "Ascii string: " << str;
std::cout << "Lower case: ";
for (std::string::size_type i=0; i<str.length(); ++i)
std::cout << std::tolower(str[i],loc);
return;
}
void non_ascii_to_lower(void)
{
std::locale::global(std::locale("en_US.UTF-8"));
std::wcout.imbue(std::locale());
const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
std::wstring str = L"Zoë Saldaña played in La maldición del padre Cardona.";
std::wcout << endl << "Non-Ascii string: " << str << endl;
f.tolower(&str[0], &str[0] + str.size());
std::wcout << "Lower case: " << str << endl;
return;
}
void non_ascii_to_upper(void)
{
std::locale::global(std::locale("en_US.UTF-8"));
std::wcout.imbue(std::locale());
const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
std::wstring str = L"¥£ªÄë";
std::wcout << endl << "Non-Ascii string: " << str << endl;
f.toupper(&str[0], &str[0] + str.size());
std::wcout << "Upper case: " << str << endl;
return;
}
int main ()
{
string str="Test String.\n";
ascii_to_lower(str);
non_ascii_to_upper();
non_ascii_to_lower();
return 0;
}
The output is:
Ascii string: Test String. Lower case: test string.
Non-Ascii string: ▒▒▒▒▒ Upper case: ▒▒▒▒▒
Non-Ascii string: Zo▒ Salda▒a played in La maldici▒n del padre Cardona. Lower case: zo▒ salda▒a played in la maldici▒n del padre cardona.
The non-ascii string, though seems to get converted to upper and lower case, some of the text is not visible in the output. Why is this?
On the whole, does the sample code look ok?
Upvotes: 2
Views: 1362
Reputation: 73376
You have already a very good answer about boost. Here some additional remarks:
Character encoding
ASCII characters are encoded on 7 bits. ISO 8859-1 and windows-1252 extend the ASCII with a limited set of international characters by making use of the 8th bit.
Unicode standard extends ASCII even further and is defined on 32 bit. Several encodings are available: UTF32 on 32 bits is the easiest (1 unicode character = 1 char), but UTF16 and UTF8 encodings allow to store Unicode text with a variable sized encoding using smaller chars.
To make it even more difficult, different operating systems use different conventions. On linux, wchart_t
is in general a 32 bits wide char used for unicode, and wstring
is a string based on wchar_t
, and char
use UTF8 encoding. On windows wchar_t
is defined as 16 bits, because windows' native encoding is UCS-2 (a subset of unicode), and char
is generally understood as win1252.
Dealing with character size and encoding
So to come back on your problem, there are two aspects to consider:
the storage - If you want a one size fits it all, you could use char32_t that can hold as well ASCII as any unicode character. And use a basic_string<char32_t>
or u32string
for strings, which support all the functions you are used to handle for normal strings. Or you can you could use normal strings and adhere to UTF 8 everywhere.
the encoding - how your app interprets the value contained in your char, and to perform such operations as converting to lower or upper case. This is defined in the applicable locale
.
Fortunately, the C++ standard library can cope with all these aspects:
isupper()
, isalpha()
, ...) using the appropriate encoding Additional libraries
The ICU library doesn't seem to provide case insensitive comparison. It provides support for text processing, for example, iterating through text elements, using collation ordering and so on.
I'd suggest to keep using standard library or boost, due to the wide support these enjoy.
Upvotes: 1
Reputation: 10911
I'm a little surprised by this question. A simple search of boost case conversion
came up with as the first entry: Usage - 1.41.0 - Boost which has a entry on case conversion.
A search of stl case conversion
has an entry tolower - C++ Reference - Cplusplus.com which also shows how to convert using the STL.
To do a case insensitive search, convert both to lower or upper case and compare.
Example from code from boost.org:
string str1("HeLlO WoRld!");
to_upper(str1); // str1=="HELLO WORLD!"
Example from Cplusplus.com:
// tolower example (C++)
#include <iostream> // std::cout
#include <string> // std::string
#include <locale> // std::locale, std::tolower
int main ()
{
std::locale loc;
std::string str="Test String.\n";
for (std::string::size_type i=0; i<str.length(); ++i)
std::cout << std::tolower(str[i],loc);
return 0;
}
For ASCII characters (characters with an ASCII value < 128), there should be no problem. If you are using MCBS, you may need to use locals for code pages. Unicode should have no problems AFAIK.
As to Matt Jordan's comment:
The real issue with this request is that many languages have contextual requirements for case conversion - e.g. capital sigma 0x3A3 in Greek should become either 0x03C3 or 0x03C2, depending on whether it is at the end of a word or not.
I would be pleasantly surprised if the boost library supported this. You would have to test it and report bugs if they don't. There's no reference on their page to say if they do any contextual case conversions. A work around might be to test for both converting to lowercase and comparing, and converting to uppercase and comparing. If either is true, then there's a match, which should work for 99.99% of the cases.
An interesting paper by Bjarne Stroustrup, found here, is a good read regarding Locales.
Upvotes: 2