Reputation: 45096
s2 is a normalized s1
as string s1 and s2 appear the same
s1 and s2 have a different GetHashCode
String.Compare shows s1 and s2 as equivalent
s2 as a string has the accent
s2.ToCharArray removes the accent
Why is s2.ToCharArray different from s2 as a string?
I figured it out.
The length of s2 is 4.
The accent is just stripped out as a separate char (Int16 = 769).
String.Compare is smart enough figure it out.
What is interesting is that String.Compare figures it out but String.Contains does not.
string s1 = "xxé";
string s1copy = "xxé";
string s2 = s1.Normalize(NormalizationForm.FormD);
string s2b = "xxe";
char accent = 'é';
Debug.WriteLine(s1); // xxé
Debug.WriteLine(s2); // xxé
Debug.WriteLine(s2b); // xxe
Debug.WriteLine(s1.GetHashCode()); // 424384421
Debug.WriteLine(s1copy.GetHashCode()); // 424384421
Debug.WriteLine(s2.GetHashCode()); // 1057341801
Debug.WriteLine(s2b.GetHashCode()); // 1701495145
Debug.WriteLine(s1.Contains(accent)); // true
Debug.WriteLine(s2.Contains(accent)); // false
Debug.WriteLine(s2b.Contains(accent)); // false
Debug.WriteLine(string.Compare(s1, s1copy).ToString()); // 0
Debug.WriteLine(string.Compare(s1, s2).ToString()); // 0
Debug.WriteLine(string.Compare(s1, s2b).ToString()); // 1
Debug.WriteLine(string.Compare(s2, s2b).ToString()); // 1
Debug.WriteLine(s1.Equals(s1copy)); // true
Debug.WriteLine(s1.Equals(s2)); // false
Debug.WriteLine(s1.Equals(s2b)); // false
Debug.WriteLine(s2.Equals(s2b)); // false
Debug.WriteLine(s1 == s1copy); // true
Debug.WriteLine(s1 == s2); // false
Debug.WriteLine(s1 == s2b); // false
Debug.WriteLine(s2 == s2b); // false
char[] chars1 = s1.ToCharArray();
char[] chars2 = s2.ToCharArray();
char[] chars2b = s2b.ToCharArray();
Debug.WriteLine(chars1.Length.ToString()); // 3
Debug.WriteLine(chars2.Length.ToString()); // 4
Debug.WriteLine(chars2b.Length.ToString()); // 3
Debug.WriteLine(chars1[0].ToString() + " " + ((Int16)chars1[0]).ToString() + " " + chars1[1].ToString() + " " + ((Int16)chars1[1]).ToString() + " " + chars1[2].ToString() + " " + ((Int16)chars1[2]).ToString());
// x 120 x 120 é 233
Debug.WriteLine(chars2[0].ToString() + " " + ((Int16)chars2[0]).ToString() + " " + chars2[1].ToString() + " " + ((Int16)chars2[1]).ToString() + " " + chars2[2].ToString() + " " + ((Int16)chars2[2]).ToString() +" " + chars2[3].ToString() + " " + ((Int16)chars2[3]).ToString());
//x 120 x 120 e 101 ́ 769
Debug.WriteLine(chars2b[0].ToString() + " " + ((Int16)chars2b[0]).ToString() + " " + chars2b[1].ToString() + " " + ((Int16)chars2b[1]).ToString() + " " + chars2b[2].ToString() + " " + ((Int16)chars2b[2]).ToString());
//x 120 x 120 e 101
Debug.WriteLine(chars1.GetHashCode()); // 16098066
Debug.WriteLine(chars2.GetHashCode()); // 53324351
Debug.WriteLine(chars2b.GetHashCode()); // 50785559
Debug.WriteLine(chars1 == chars2); // false
Debug.WriteLine(chars1 == chars2b); // false
Debug.WriteLine(chars2 == chars2b); // false
Upvotes: 2
Views: 340
Reputation: 18162
Why is s2.ToCharArray different from s2 as a string?
This occurs because of the NormalizationForm
you have chosen. It will decompose xxé
to x, x, e, and `
Indicates that a Unicode string is normalized using full canonical decomposition.
If this still is unclear, here is a definition of Unicode Composition
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
Essentially, you're decomposing the string to its lowest form, which is the four different characters you're seeing.
Maybe it will be more clear if you try recombining the char[]
var s2Compare = new string(chars2)
var isEq = (s2Compare == s2) //true
Upvotes: 3