Dave R.
Dave R.

Reputation: 7304

How to fold/normalize ligature characters in C# for string comparisons

I am exploring some of the more esoteric aspects of string comparisons at the moment (I went down a bit of a rabbit hole and there doesn't seem to be an end in sight!).

I'd like to know how to compare a string containing characters with ligatures to a canonical non-ligature version (imagine an application for French language learning that lets the user type in oeuf or œuf interchangeably). I think this is called 'folding', but I could be wrong.

I've tried normalizing my strings using NFKD, which I thought would decompose the character into its constituent parts, but only some Unicode codepoints support decomposition. (Of course, my example character 'œ' doesn't, which resulted in much hair-pulling.)

For example:

using System.Text;
using System.Globalization;

// This character does not support decomposition.
string str1 = "\u0153"; // œ (LATIN SMALL LIGATURE OE)
string str2 = "oe";
string str1norm = str1.Normalize(NormalizationForm.FormKD);

Console.WriteLine(str1norm.IsNormalized()); // True
Console.WriteLine(str1norm.Equals(str2));   // False
Console.WriteLine(str1norm.Length);         // 1

// This character supports decomposition.
str1 = "\uFB06"; // st (LATIN SMALL LIGATURE ST)
str2 = "st";
str1norm = str1.Normalize(NormalizationForm.FormKD);

Console.WriteLine(str1norm.IsNormalized()); // True
Console.WriteLine(str1norm.Equals(str2));   // True
// (Non-normalized comparison.) True in most locales, but not all (see below)
Console.WriteLine(str1.Equals(str2, StringComparison.CurrentCultureIgnoreCase)); 
Console.WriteLine(str1norm.Length);         // 2

References for the two Unicode characters:

If only some ligatures are decomposable, I have a couple of questions:

  1. How do I determine this without manually checking through all the Unicode code points?
  2. Can I do a string comparison to a canonical version of the string without having to create a dictionary of all the non-decomposable characters and their expanded forms? I mean, I like typing, but not that much.
  3. (Bonus round.) Why are some multi-character code points decomposable (like the ligature 'st' example), and some not? Is there something special about characters like 'œ'?

Edit: I've done some further research, and characters like 'œ' or the German 'sharp s' are thought of as distinct characters, not composites. They represent the current usage in those languages, even if historically things may have developed from different representations. Unicode takes into account current usage, so marking them as non-decomposable makes sense.

As a final illustration of my "what-on-earth-is-going-on?" mindset at the moment, 'st' is equivalent to 'st' without normalizing in 862 locales on my PC, with the string comparison failing in only 7, with those failure locales being:

This is all very interesting, but the sheer apparent arbitrariness of it all is a bit overwhelming :) As far as I know, Afar and Saho have Latin-based written forms, but I assume their Unicode code pages are less popular and may just lack the entries for ligatures which are present in others.

Finally finally, the single-character 'st' and its decomposed form 'st' are equal in the .NET Invariant culture, but only if case is ignored:

//...continuing from prior code...
Console.WriteLine(str1.Equals(str2, StringComparison.InvariantCulture));           // False
Console.WriteLine(str1.Equals(str2, StringComparison.InvariantCultureIgnoreCase)); // True

This is puzzling because both strings are lowercase.

(I know, "Just normalize your string and forget you ever saw this, Dave." But I still think it's interesting to know why this happens.)

Thanks for your time.

Upvotes: 2

Views: 143

Answers (0)

Related Questions