Kunjan
Kunjan

Reputation: 11

How does garbage collector decodes that a reference type object is of no use?

CLR garbage collector actively goes through all objects that have been created and works out if they are being used. But, how does garbage collector decide which object are to be killed and which are in use?

I understand the concept of assigning a null value to object will suffice. But, what if I write only

string obj = new string(new char[] {'a'}); 

and not null assignment lineobj = null;.

How will garbage collector determine when to clean it?

Upvotes: -2

Views: 1336

Answers (1)

Jörg W Mittag
Jörg W Mittag

Reputation: 369458

The CLR Garbage Collector is (at its core) a so-called tracing GC. (The other "big" class of garbage collectors are so-called reference-counting GCs.)

Tracing GCs work by, well recursively "tracing" the set of reachable objects from a set of objects that are already known to be reachable. Here's how that works:

Assume that we already have a set of objects that we know are reachable. For every object in that set, follow all the references (e.g. fields, and also internal pointers such as the class pointer etc.). Those objects are also reachable. Repeat until you have visited all objects at least once. Now you know all reachable objects. (We could say that we have computed the transitive closure with respect to reachability.) All objects that you haven't visited are not reachable and thus eligible for garbage collection.

Now, we just have to figure out how to start this algorithm, i.e. how do get the first set of objects that are known to be reachable. Well, every language usually has a set of objects that are known to be always reachable. This set from which we are starting our trace from, is called the root set. It includes things like:

  • globals
  • pointers in CPU registers
  • objects referenced by local variables on the stack
  • objects on the stack
  • Thread-Local Storage
  • unsafe memory
  • native memory
  • VM-internal data structures
  • the root namespace

That's it.

There are, of course, many variations of this theme. The most simple implementation of this tracing idea is called mark-sweep. It has two phases, mark and sweep (duh!) The mark phase is the tracing phase, you trace the reachable objects, and then you set a bit in the object header which says "yep, reachable". In the sweep phase, you collect all objects which don't have the bit set and reset the bit to false in the other objects.

A slight improvement of this scheme is to keep a separate marking table. For one, you don't have to write all over the entire RAM just to set those marking bits (which throws all data out of the cache, for example, and also triggers a copy-on-write if the memory is shared with another process). And secondly, you don't have to visit the reachable objects to reset the marking bit, you can just throw away the marking table after you're done.

The biggest dis-advantage of this scheme is that it leads to memory fragmentation. The biggest advantage is that objects don't move around in memory, which means for example that you can hand out pointers to objects without fear that those pointers may become invalid.

Another, very simple scheme, is Henry Baker's semi-space copying collector. It is called "semi-space" because it always uses at most 50% of the allocated memory. It is also a tracing collector, but it is a copying collector instead of mark-sweep. Instead of marking the objects when visiting them, it copies them over to the empty half of the memory. Afterwards, the old half can be simply freed in constant time.

The advantage is that everytime you copy the objects, they will be neatly tightly packed in memory without holes, so there is no fragmentation. But, they move around in memory, so you cannot just hand out pointers to those objects.

Note: the CLR's Garbage Collectors (it actually has two of them!) are much more complex and sophisticated than those two schemes I presented. They are, however, both tracing GCs.

The second big class of collectors are reference-counting collectors. Instead of tracing references only when a collection occurs, they count references, everytime a reference is created or destroyed. So, when you assign an object to a local variable, or a field, or pass it as an argument, …, the system increments a reference counter in the object header, and everytime you assign a different object to a local variable, or the local variable goes out of scope, or the object that the field belongs to gets GCd, …, the reference is decremented. If the reference count hits 0, there are no more references, and the object is eligible for garbage collection.

The big advantage of this scheme is that you always know exactly when an object becomes unreachable. The big disadvantage is that you can get disconnected cycles whose reference count(s) will never be 0. If you have a reference from AB, from BC, from CA, and from DB, then A's reference count is 1, B's reference count is 2, C's reference count is 1. If you now remove the reference from D, B's reference count drops to 1, and there is no reference from the rest of the system to either A or B or C, so they are all not reachable, but their reference count will never drop to 0, so they will never be collected.

A third big idea in GC is the Generational Hypothesis:

  • Objects die young
  • Older objects don't reference younger objects

As it turns out, for typical systems, this is true for almost all objects. Which means, it makes sense to treat objects differently depending on their age. A generational GC divides the objects into different generations, and has different garbage collection and memory allocation strategies for each. (Let's leave it at that.)

For more information about garbage collection in general, you should read The Garbage Collection Handbook – The art of automatic memory management by Richard Jones, Antony Hosking, Eliot Moss.

Upvotes: 3

Related Questions