Nitramk
Nitramk

Reputation: 1572

Case insensitive search in Unicode in C++ on Windows

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.

In short: In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.

Background: I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.

I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.

I'm using Visual Studio 2008 with C++, STL and Boost.

Upvotes: 1

Views: 3458

Answers (4)

Mark Thornton
Mark Thornton

Reputation: 1885

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

Upvotes: 4

Serge Wautier
Serge Wautier

Reputation: 21898

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

Upvotes: 0

Michael Dillon
Michael Dillon

Reputation: 32392

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.

Upvotes: 0

Ferruccio
Ferruccio

Reputation: 100748

Boost String Algorithms has an icontains() function template which may do what you need.

Upvotes: 1

Related Questions