Maddy
Maddy

Reputation: 1379

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.

Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".

I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:

  1. If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?

  2. If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?

  3. Are there alternatives that facilitate this?

I read this post, but it doesn't quite meet my requirements.

Thanks!

Upvotes: 3

Views: 2992

Answers (4)

Charlie Reitzel
Charlie Reitzel

Reputation: 925

For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.

The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.

As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).

At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.

There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)

Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.

If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.

See also:

Upvotes: 4

Jaccoud
Jaccoud

Reputation: 97

Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)

If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence

Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.

A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.

C'est la vie. これが私たちの世界です。

Upvotes: -1

Serge Ballesta
Serge Ballesta

Reputation: 148910

Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).

  • in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
  • in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'

so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character

but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character

You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.

(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'

Upvotes: 2

MSalters
MSalters

Reputation: 179819

The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.

The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)

Upvotes: 6

Related Questions