Kishnan
Kishnan

Reputation: 657

Optimize memory usage of a collection of Strings in Java

I have a large number of name - value pairs (approx 100k) that I need to store in some sort of cache (say a hash map) where the value is a string with an average of about 30k bytes in size.

Now I know for a fact that a large number of the values have exactly the same string data. In order to avoid having to allocate the identical string data several times, I would like to somehow reuse a previously allocated string thus consuming less memory. In addition this needs to be reasonably fast. i.e. scanning through all the previously allocated values one-by-one is not an option.

Any recommendations on how I could solve this problem?

Upvotes: 9

Views: 5077

Answers (7)

Tom Hawtin - tackline
Tom Hawtin - tackline

Reputation: 147164

It somewhat depends upon how you are creating the String.

One possible way is to use TreeSet that uses a Comparator that can compare existing Strings and the source of your new String. Use SortedSet.tailSet and an Iterator to find an existing String. Or alternatively NavigableSet.ceiling/floor or a TreeMap with a similar setup.

String.intern has performance problems.

Upvotes: 1

kdgregory
kdgregory

Reputation: 39606

Agreed with others on not using String.intern(): once you've put a string there, it will never go away. Look to early revisions of Xerces for why this is a bad idea.

A better solution is to use a WeakHashMap, wrapping the value in a WeakReference:

private Map<String,WeakReference<String>> _map 
    = new WeakHashMap<String,WeakReference<String>>();

public synchronized String intern(String str)
{
    WeakReference<String> ref = _map.get(str);
    String s2 = (ref != null) ? ref.get() : null;
    if (s2 != null)
        return s2;
    str = new String(str);
    _map.put(str, new WeakReference(str));
    return str;
}

This code is from an article that I wrote on the Java reference objects. You'll find the explanation there.

EDIT: need to create a new string here (and I'll update the article) because the original might be a substring of a far larger character array. I thought that was fixed around JDK 1.3, but apparently not (at least not in 1.5).

Upvotes: 1

Neil Coffey
Neil Coffey

Reputation: 21795

Do you actually need Strings, or do you just need any old CharSequence? If not, then consider implementing a "compact" CharSequence such as the one I suggest in the link.

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533520

You could compress the strings. A 30K string should get a good compression ratio. I wrote a hack to compress large String as an exercise, but you could use a byte[] of the compressed data to store the String.

A 30K character string will use about 60KB (2 bytes per character) so even using getBytes() is likely to be an improvement.

Upvotes: 0

Brian Agnew
Brian Agnew

Reputation: 272287

String.intern() will help you here (most likely). It will resolve multiple instances of the same string down to one copy.

EDIT: I suggested this would 'most likely' help. In what scenarios will it not ? Interning strings will have the effect of storing those interned string representations permanently. If the problem domain is a one-shot process, this may not be an issue. If it's a long running process (such as a web app) then you may well have a problem.

I would hesitate to say never use interning (I would hesistate to say never do anything). However there are scenarios where it's not ideal.

Upvotes: 9

james
james

Reputation: 1377

Do not use String.intern (there have been various memory issues related to this through the years). instead, create your own cache, similar to String.intern. basically, you want a Map, where each key maps to itself. then, before caching any string, you "intern" it:

private Map<String,WeakReference<String>> myInternMap = new WeakHashMap<String,,WeakReference<String>>();
public String intern(String value) {
  synchronized(myInternMap) {
    WeakReference<String> curRef = myInternMap.get(value);
    String curValue = ((curRef != null) ? curRef.get() : null);
    if(curValue != null) {
      return curValue;
    }

    myInternMap.put(value, new WeakReference<String>(value));
    return value;
  }
}

note, you use weakreferences for the keys and values so that you don't keep references for strings which you are no longer using.

Upvotes: 10

Miserable Variable
Miserable Variable

Reputation: 28752

String.intern is the obvious choice as Brian says. But if you don't want to intern across all the String in memory, you can use a Set to first see if the value is present. Here's untested code. You will have to work out removing from reverse map when removing from main

  class Map2<K, V> implements Map<K, V>
  {
    Map<K, V> _map = Maps.newHashMap();
    Set<V, V> _rev = Maps.newHashMap();

    V put(K k, V v) {
      if (_rev.containsKey(v)) {
        V prev = _rev.get(v);
        return _map.put(k, prev);
      } else {
        _rev.put(v, v);
        return _map.put(k,v);
      }
   }

Upvotes: 4

Related Questions