monica
monica

Reputation: 1065

Factory/Caching strategy for sharing large immutable objects

My problem is much like the previous post Optimal HashSet Initialization (Scala | Java), where I want to use HashSet for speedups (currently i am using Set) but the HashSet does not exhibit its (Constant time)advantages.

For the solution mentioned:

You can minimize the cost of equals by interning. This means that you acquire new objects of the class through a factory method, which checks whether the requested new object already exists, and if so, returns a reference to the existing object. If you assert that every object of this type is constructed in this way you know that there is only one instance of each distinct object and equals becomes equivalent to object identity, which is a cheap reference comparison (eq in Scala).

However, I am not quite sure what's the efficient way to check

whether the requested new object already exists

for large objects (e.g. objects of case class with parameter of hashmap, some other object structures...etc)

By comparing each of those complicated fields do not give out much performance advantage, isn't it? Or if it is, are there other ways?

In addition, I'm also confused that how to make

equals becomes equivalent to object identity, which is a cheap reference comparison (eq in Scala).

in code.

The intening technique mentioned above, I think, is basically an object cache. Therefore, I reference to the technique mentioned in the post Caching strategy for small immutable objects in Java?. However, I still do not see what's the efficient way for large objects.

For convenience, I quoted the caching technique (in Java) from the post with /// denoting my thoughts and questions:

private static final int N_POINTS = 10191; 
private static final Point[] POINTS = new Point[N_POINTS];

public static Point of(int x, int y, int z) {
    int h = hash(x,y,z); ///  I can use hash code of each complicated field to construct the value
    int index = (h & 0x7fffffff) % N_POINTS;
    Point p = POINTS[index];
    if (p != null && p.x == x && p.y == y && p.z == z) /// Not working for large objects?
       return p;
    return POINTS[index] = new Point(x,y,z);
} 

To summarize, what's the best practice to implement efficient caching strategy for large objects, so that I can take advantage of HashSet in Scala?

Thanks,

Upvotes: 3

Views: 1049

Answers (1)

Richard Sitze
Richard Sitze

Reputation: 8463

The goal of interning is to enable the equals method to be implemented using reference equality as: this eq that (or this == that in Java). It's clear that this implementation will have optimal run-time characteristics over the more traditional equals that compares some set of fields.

This comparison is only effective if there is one and only one instance of each "unique object" as determined by some set of fields of the object.

Intern-ing is effective only in so far as the up-front cost of the intern operation can be fully offset by the minimized cost of the (possibly many) calls to equals, driven by HashMap.

As you've noted, this intern-ing may require a potentially costly caching mechanism: there is a run time overhead (performing the check) and a memory overhead (the size of the cache).

  • The most straight forward way to cache is with HashMap, and a traditional equals. hashCode should be lazy; caching it's result so it doesn't need to be recomputed. Threading issues may need to be considered.

  • One way to implement such a cache is to use a trie, perhaps implemented with a hash table at each node, and where each "level" corresponds to a field in the object (first level - field 1, second level, field 2, etc...) for the "set of fields used to establish uniqueness."

There are other viable ways to implement such a cache. Forgive me for avoiding any further discussion of such, and allow me instead to present ways to avoid dealing with the issue.

Option 1: no caching

Claim: You're likely to have sufficiently effective results by using a fast hash (and caching it internally), a "traditional" implementation of equals, and by starting with a HashMap or HashSet of sufficient minimal size

Ideally there are few collisions in the hash table, and the number of calls to equals is minimal.

Option 2: map multiple fields into a one unique String

[This method assumes that the "fields that uniquely define the object" are immutable. Suitable adjustments can be made to compensate if this is not true.]

Build & cache a private unique: String that corresponds to the unique object instance. For example, the following may be unique for some simple objects:

The concatenation of the string values of the "set of fields used to establish uniqueness," separated by commas.

An understanding of your object/field characteristics will help determine how such a unique string can be created.

With such a value, we can avoid the separate intern/caching mechanism and retain much of the benefits by implementing both equals and hashCode in terms of this unique string:

def equals(thatObj: Any) = thatObj match {
    case that : MyType => unique.equals(that.unique)
    case _             => false
  }

def hashCode() = unique.hashCode

Alternative to Option 2:

[EDIT: Rüdiger Klaehn provided this link which offers compelling evidence to avoid String.intern()]

Use String.intern and adjust the equals to take advantage of it:

private val unique = buildUnique().intern

def equals(thatObj: Any) = thatObj match {
    case that : MyType => unique.eq(that.unique) // In Java: unique == that.unique
    case _             => false
  }

Upvotes: 4

Related Questions