Maddy
Maddy

Reputation: 1379

Case insensitive operations

I'm working on a project wherein the case sensitive operations needs to be replaced with case insensitive operations. After doing some reading on this, the type of data to be considered are:

  1. Ascii characters
  2. Non-ascii characters
  3. Unicode characters

Please let me know if I've missed anything in the list.

Do the above need to be handled separately or are there libraries for C++ which can handle them all without concerning the type of data?

Specifically:

  1. Does the boost library provide support for this? If so, are there sample examples or documentation on how to use the APIs?

  2. I learned about IBM's International Components of Unicode (ICU). Is this a library that provides support for case insensitive operations? If so, are there sample examples or documentation on how to use the APIs?

Finally, which among the aforementioned (and other) approaches is better and why?

Thanks!

Based on the comments and answers, I wrote a sample program to understand this better:

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

using namespace std;

void ascii_to_lower(string& str)
{
     std::locale loc;
     std::cout << "Ascii string: " << str;
     std::cout << "Lower case: ";

     for (std::string::size_type i=0; i<str.length(); ++i)
         std::cout << std::tolower(str[i],loc);
     return;
}

void non_ascii_to_lower(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"Zoë Saldaña played in La maldición del padre Cardona.";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.tolower(&str[0], &str[0] + str.size());

    std::wcout << "Lower case: " << str << endl;

    return;
}

void non_ascii_to_upper(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"¥£ªÄë";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.toupper(&str[0], &str[0] + str.size());

    std::wcout << "Upper case: " << str << endl;

    return;
}

int main ()
{
    string str="Test String.\n";

    ascii_to_lower(str);
    non_ascii_to_upper();
    non_ascii_to_lower();

    return 0;
}

The output is:

Ascii string: Test String. Lower case: test string.

Non-Ascii string: ▒▒▒▒▒ Upper case: ▒▒▒▒▒

Non-Ascii string: Zo▒ Salda▒a played in La maldici▒n del padre Cardona. Lower case: zo▒ salda▒a played in la maldici▒n del padre cardona.

The non-ascii string, though seems to get converted to upper and lower case, some of the text is not visible in the output. Why is this?

On the whole, does the sample code look ok?

Upvotes: 2

Views: 1362

Answers (2)

Christophe
Christophe

Reputation: 73376

You have already a very good answer about boost. Here some additional remarks:

Character encoding

ASCII characters are encoded on 7 bits. ISO 8859-1 and windows-1252 extend the ASCII with a limited set of international characters by making use of the 8th bit.

Unicode standard extends ASCII even further and is defined on 32 bit. Several encodings are available: UTF32 on 32 bits is the easiest (1 unicode character = 1 char), but UTF16 and UTF8 encodings allow to store Unicode text with a variable sized encoding using smaller chars.

To make it even more difficult, different operating systems use different conventions. On linux, wchart_t is in general a 32 bits wide char used for unicode, and wstring is a string based on wchar_t, and char use UTF8 encoding. On windows wchar_t is defined as 16 bits, because windows' native encoding is UCS-2 (a subset of unicode), and char is generally understood as win1252.

Dealing with character size and encoding

So to come back on your problem, there are two aspects to consider:

  • the storage - If you want a one size fits it all, you could use char32_t that can hold as well ASCII as any unicode character. And use a basic_string<char32_t> or u32string for strings, which support all the functions you are used to handle for normal strings. Or you can you could use normal strings and adhere to UTF 8 everywhere.

  • the encoding - how your app interprets the value contained in your char, and to perform such operations as converting to lower or upper case. This is defined in the applicable locale.

Fortunately, the C++ standard library can cope with all these aspects:

  • locale help to manage uppercase & lowercase conversion and testing (e.g. isupper(), isalpha(), ...) using the appropriate encoding
  • codecvt allows to convert between various encondings

Additional libraries

The ICU library doesn't seem to provide case insensitive comparison. It provides support for text processing, for example, iterating through text elements, using collation ordering and so on.

I'd suggest to keep using standard library or boost, due to the wide support these enjoy.

Upvotes: 1

Adrian
Adrian

Reputation: 10911

I'm a little surprised by this question. A simple search of boost case conversion came up with as the first entry: Usage - 1.41.0 - Boost which has a entry on case conversion.

A search of stl case conversion has an entry tolower - C++ Reference - Cplusplus.com which also shows how to convert using the STL.

To do a case insensitive search, convert both to lower or upper case and compare.

Example from code from boost.org:

string str1("HeLlO WoRld!");
to_upper(str1); // str1=="HELLO WORLD!"

Example from Cplusplus.com:

// tolower example (C++)
#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

int main ()
{
  std::locale loc;
  std::string str="Test String.\n";
  for (std::string::size_type i=0; i<str.length(); ++i)
    std::cout << std::tolower(str[i],loc);
  return 0;
}

For ASCII characters (characters with an ASCII value < 128), there should be no problem. If you are using MCBS, you may need to use locals for code pages. Unicode should have no problems AFAIK.

As to Matt Jordan's comment:

The real issue with this request is that many languages have contextual requirements for case conversion - e.g. capital sigma 0x3A3 in Greek should become either 0x03C3 or 0x03C2, depending on whether it is at the end of a word or not.

I would be pleasantly surprised if the boost library supported this. You would have to test it and report bugs if they don't. There's no reference on their page to say if they do any contextual case conversions. A work around might be to test for both converting to lowercase and comparing, and converting to uppercase and comparing. If either is true, then there's a match, which should work for 99.99% of the cases.

An interesting paper by Bjarne Stroustrup, found here, is a good read regarding Locales.

Upvotes: 2

Related Questions