m.edmondson
m.edmondson

Reputation: 30922

Does string comparison == only work because strings are immutable?

I had a thought before when comparing two strings with their variables:

string str1 = "foofoo";
string strFoo = "foo";
string str2 = strFoo + strFoo;

// Even thought str1 and str2 reference 2 different
//objects the following assertion is true.

Debug.Assert(str1 == str2);

Is this purely because the .NET runtime recognises the string's value is the same and because strings are immutable makes the reference of str2 equal to that of str1?

So when we do str1 == str2 we are actually comparing references and not the values? I originally thought this was the product of syntactic sugar, but was I being incorrect?

Any inaccuracies with what I've written?

Upvotes: 13

Views: 1467

Answers (8)

Brian Rasmussen
Brian Rasmussen

Reputation: 116471

If we take a look at the jitted code, we'll see that str2 is assembled using String.Concat and that it in fact is not the same reference as str1. We will also see that the comparison is done using Equals. In other words the assert passes as the strings contain the same characters.

This code

static void Main(string[] args)
{
    string str1 = "foofoo";
    string strFoo = "foo";
    string str2 = strFoo + strFoo;
    Console.WriteLine(str1 == str2);
    Debugger.Break();
}

is jitted to (please scroll sideways to see comments)

C:\dev\sandbox\cs-console\Program.cs @ 22:
00340070 55              push    ebp
00340071 8bec            mov     ebp,esp
00340073 56              push    esi
00340074 8b3530206003    mov     esi,dword ptr ds:[3602030h] ("foofoo")  <-- Note address of "foofoo"

C:\dev\sandbox\cs-console\Program.cs @ 23:
0034007a 8b0d34206003    mov     ecx,dword ptr ds:[3602034h] ("foo")  <-- Note different address for "foo"

C:\dev\sandbox\cs-console\Program.cs @ 24:
00340080 8bd1            mov     edx,ecx
00340082 e81977fe6c      call    mscorlib_ni+0x2b77a0 (6d3277a0)     (System.String.Concat(System.String, System.String), mdToken: 0600035f)  <-- Call String.Concat to assemble str2
00340087 8bd0            mov     edx,eax
00340089 8bce            mov     ecx,esi
0034008b e870ebfd6c      call    mscorlib_ni+0x2aec00 (6d31ec00)     (System.String.Equals(System.String, System.String), mdToken: 060002d2)  <-- Compare using String.Equals
00340090 0fb6f0          movzx   esi,al
00340093 e83870f86c      call    mscorlib_ni+0x2570d0 (6d2c70d0) (System.Console.get_Out(), mdToken: 060008fd)
00340098 8bc8            mov     ecx,eax
0034009a 8bd6            mov     edx,esi
0034009c 8b01            mov     eax,dword ptr [ecx]
0034009e 8b4038          mov     eax,dword ptr [eax+38h]
003400a1 ff5010          call    dword ptr [eax+10h]

C:\dev\sandbox\cs-console\Program.cs @ 28:
003400a4 e87775596d      call    mscorlib_ni+0x867620 (6d8d7620) (System.Diagnostics.Debugger.Break(), mdToken: 0600239a)

C:\dev\sandbox\cs-console\Program.cs @ 29:
>>> 003400a9 5e              pop     esi
003400aa 5d              pop     ebp
003400ab c3              ret

Upvotes: 7

Jon Hanna
Jon Hanna

Reputation: 113352

In the order in which your code hits it...

== is overridden. This means that rather than "abc" == "ab" + "c" calling the default == for reference types (which compares references and not values) it calls into string.Equals(a, b).

Now, this does the following:

  1. If the two are indeed the same reference, return true.
  2. If either are null, return false (we would have already returned true above if they were both null).
  3. if the two are different length, return false;
  4. Do an optimised cycle through one string, comparing it char-for-char with the rest (actually int-for-int as viewed as two blocks of ints in memory, which is one of the optimisations involved). If it reaches the end without a mismatch, then return true, otherwise return false.

In other words, it starts with something like:

public static bool ==(string x, string y)
{
  //step 1:
  if(ReferenceEquals(x, y))
    return true;
  //step 2:
  if(ReferenceEquals(x, null) || ReferenceEquals(y, null))
    return false;
  //step 3;
  int len = x.Length;
  if(len != y.Length)
    return false;
  //step 4:
  for(int i = 0; i != len; ++i)
    if(x[i] != y[i])
      return false;
  return true;
}

Except that step 4 is a pointer-based version with an unrolled loop that should hence ideally be faster. I won't show that because I want to talk about the overall logic.

There are significant short-cuts. The first is in step 1. Since equality is reflexive (identity entails equality, a == a) then we can return true in nanoseconds for even a string several MB in size, if compared with itself.

Step 2 isn't a short-cut, because its a condition that must be tested for, but note that because we'll have already have returned true for (string)null == (string)null we don't need another branch. So the order of calling is geared to a quick result.

Step 3 allows two things. It both short-cuts on strings of different length (always false) and means that one cannot accidentally shoot past the end of one of the strings being compared in step 4.

Note that this is not the case for other string comparisons, since e.g. WEISSBIER and weißbier are different lengths but the same word in different capitalisation, so case-insensitive comparison cannot use step 3. All equality comparisons can do step 1 and 2 as the rules used always hold, so you should use them in your own, only some can do step 3.

Hence, while you are wrong in suggesting that it is references rather than values that are compared, it is true that references are compared first as a very significant short-cut. Note also that interned strings (strings placed in the intern pool by compilation or by string.Intern called) will hence trigger this short-cut often. This would be the case in the code in your example, as the compiler will have used the same reference in each case.

If you know that a string was interned you can depend upon this (just do reference equality test), but even if you don't know for sure you can benefit from it (reference equality test will short-cut at least some of the time).

If you have a bunch of strings where you will want to test some of them against each other often, but you don't want to extend their lifetime in memory as much as interning does, then you could use an XmlNameTable or LockFreeAtomizer (soon to be renamed ThreadSafeAtomizer and the doc moved to http://hackcraft.github.com/Ariadne/documentation/html/T_Ariadne_ThreadSafeAtomizer_1.htm - should have been named for function rather than implementation details in the first place).

The former is used internally by XmlTextReader and hence by a lot of the rest of System.Xml and can be used by other code too. The latter I wrote because I wanted a similar idea, that was safe for concurrent calls, for different types, and where I could override the equality comparison.

In either case, if you put 50 different strings that are all "abc" into it, you'll get a single "abc" reference back allowing the others to be garbage collected. If you know this has happened you can depend upon ReferenceEquals alone, and if you're not sure, you'll still benefit from the short-cut when it is the case.

Upvotes: 1

Diego
Diego

Reputation: 1569

According to the msdn (http://msdn.microsoft.com/en-us/library/53k8ybth.aspx):

For predefined value types, the equality operator (==) returns true if the values of its operands are equal, false otherwise. For reference types other than string, == returns true if its two operands refer to the same object. For the string type, == compares the values of the strings.

Upvotes: 0

STW
STW

Reputation: 46394

The reference equality operator == can be overridden; and in the case of System.String it is overridden to use value-equality behavior. For true reference-equality you can use the Object.ReferenceEquals() method, which cannot be overridden.

Upvotes: 2

Chris Shain
Chris Shain

Reputation: 51369

No.

== works because the String class overloads the == operator to be equivalent to the Equals method.

From Reflector:

[TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
public static bool operator ==(string a, string b)
{
    return Equals(a, b);
}

Upvotes: 10

TomTom
TomTom

Reputation: 62157

Is this purely because the .NET runtime recognises the string's value is the same and because strings are immutable makes the reference of str2 equal to that of str1?

No. FIrst, it is because str1 and str2 ARE identical - they are the same string becauset he compiler can optimize that out. strFoo + strFoo is a compile time constant itendical to str1. As strings are INTERNED in classes they use the same string.

Second, string OVERRIDES tthe == method. CHeck the source code from the reference sources available on the internet for some time.

Upvotes: 2

DaveShaw
DaveShaw

Reputation: 52798

The answer is in the C# Spec §7.10.7

The string equality operators compare string values rather than string references. When two separate string instances contain the exact same sequence of characters, the values of the strings are equal, but the references are different. As described in §7.10.6, the reference type equality operators can be used to compare string references instead of string values.

Upvotes: 14

Magnus
Magnus

Reputation: 46977

Actually, String.Equals first checks if it is the same reference and if not compares the content.

Upvotes: 7

Related Questions