Normalized string is not the same as ToCharArray

Question

s2 is a normalized s1
as string s1 and s2 appear the same
s1 and s2 have a different GetHashCode
String.Compare shows s1 and s2 as equivalent

s2 as a string has the accent
s2.ToCharArray removes the accent

Why is s2.ToCharArray different from s2 as a string?

I figured it out.
The length of s2 is 4.
The accent is just stripped out as a separate char (Int16 = 769).
String.Compare is smart enough figure it out.

What is interesting is that String.Compare figures it out but String.Contains does not.

string s1 = "xxé";
string s1copy = "xxé";
string s2 = s1.Normalize(NormalizationForm.FormD);
string s2b = "xxe";
char accent = 'é';

Debug.WriteLine(s1);  // xxé
Debug.WriteLine(s2);  // xxé
Debug.WriteLine(s2b); // xxe

Debug.WriteLine(s1.GetHashCode());      // 424384421
Debug.WriteLine(s1copy.GetHashCode());  // 424384421
Debug.WriteLine(s2.GetHashCode());      // 1057341801
Debug.WriteLine(s2b.GetHashCode());     // 1701495145

Debug.WriteLine(s1.Contains(accent));   // true
Debug.WriteLine(s2.Contains(accent));   // false
Debug.WriteLine(s2b.Contains(accent));  // false

Debug.WriteLine(string.Compare(s1, s1copy).ToString());  // 0
Debug.WriteLine(string.Compare(s1, s2).ToString());      // 0
Debug.WriteLine(string.Compare(s1, s2b).ToString());     // 1
Debug.WriteLine(string.Compare(s2, s2b).ToString());     // 1

Debug.WriteLine(s1.Equals(s1copy));  // true
Debug.WriteLine(s1.Equals(s2));      // false
Debug.WriteLine(s1.Equals(s2b));     // false
Debug.WriteLine(s2.Equals(s2b));     // false

Debug.WriteLine(s1 == s1copy);  // true
Debug.WriteLine(s1 == s2);      // false
Debug.WriteLine(s1 == s2b);     // false
Debug.WriteLine(s2 == s2b);     // false

char[] chars1  = s1.ToCharArray();
char[] chars2  = s2.ToCharArray();
char[] chars2b = s2b.ToCharArray();
Debug.WriteLine(chars1.Length.ToString());  // 3
Debug.WriteLine(chars2.Length.ToString());  // 4
Debug.WriteLine(chars2b.Length.ToString()); // 3
Debug.WriteLine(chars1[0].ToString() + " "  + ((Int16)chars1[0]).ToString() + " "  + chars1[1].ToString() + " "  + ((Int16)chars1[1]).ToString() + " "  + chars1[2].ToString() + " "  + ((Int16)chars1[2]).ToString());
// x 120 x 120 é 233
Debug.WriteLine(chars2[0].ToString() + " " + ((Int16)chars2[0]).ToString() + " " + chars2[1].ToString() + " " + ((Int16)chars2[1]).ToString() + " " + chars2[2].ToString() + " " + ((Int16)chars2[2]).ToString() +" " + chars2[3].ToString() + " " + ((Int16)chars2[3]).ToString());  
//x 120 x 120 e 101 ́ 769
Debug.WriteLine(chars2b[0].ToString() + " " + ((Int16)chars2b[0]).ToString() + " " + chars2b[1].ToString() + " " + ((Int16)chars2b[1]).ToString() + " " + chars2b[2].ToString() + " " + ((Int16)chars2b[2]).ToString()); 
//x 120 x 120 e 101
Debug.WriteLine(chars1.GetHashCode());   // 16098066
Debug.WriteLine(chars2.GetHashCode());   // 53324351
Debug.WriteLine(chars2b.GetHashCode());  // 50785559
Debug.WriteLine(chars1 == chars2);  // false
Debug.WriteLine(chars1 == chars2b); // false
Debug.WriteLine(chars2 == chars2b); // false

Khan · Accepted Answer

Why is s2.ToCharArray different from s2 as a string?

This occurs because of the NormalizationForm you have chosen. It will decompose xxé to x, x, e, and `

NormalizationForm.FormD:

Indicates that a Unicode string is normalized using full canonical decomposition.

If this still is unclear, here is a definition of Unicode Composition

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.

Essentially, you're decomposing the string to its lowest form, which is the four different characters you're seeing.

Maybe it will be more clear if you try recombining the char[]

var s2Compare = new string(chars2)
var isEq = (s2Compare == s2) //true

Normalized string is not the same as ToCharArray

Answers (1)

Related Questions