Reputation: 657
I have a large number of name - value pairs (approx 100k) that I need to store in some sort of cache (say a hash map) where the value is a string with an average of about 30k bytes in size.
Now I know for a fact that a large number of the values have exactly the same string data. In order to avoid having to allocate the identical string data several times, I would like to somehow reuse a previously allocated string thus consuming less memory. In addition this needs to be reasonably fast. i.e. scanning through all the previously allocated values one-by-one is not an option.
Any recommendations on how I could solve this problem?
Upvotes: 9
Views: 5077
Reputation: 147164
It somewhat depends upon how you are creating the String
.
One possible way is to use TreeSet
that uses a Comparator
that can compare existing String
s and the source of your new String
. Use SortedSet.tailSet
and an Iterator
to find an existing String
. Or alternatively NavigableSet.ceiling/floor
or a TreeMap
with a similar setup.
String.intern
has performance problems.
Upvotes: 1
Reputation: 39606
Agreed with others on not using String.intern(): once you've put a string there, it will never go away. Look to early revisions of Xerces for why this is a bad idea.
A better solution is to use a WeakHashMap, wrapping the value in a WeakReference:
private Map<String,WeakReference<String>> _map
= new WeakHashMap<String,WeakReference<String>>();
public synchronized String intern(String str)
{
WeakReference<String> ref = _map.get(str);
String s2 = (ref != null) ? ref.get() : null;
if (s2 != null)
return s2;
str = new String(str);
_map.put(str, new WeakReference(str));
return str;
}
This code is from an article that I wrote on the Java reference objects. You'll find the explanation there.
EDIT: need to create a new string here (and I'll update the article) because the original might be a substring of a far larger character array. I thought that was fixed around JDK 1.3, but apparently not (at least not in 1.5).
Upvotes: 1
Reputation: 21795
Do you actually need Strings, or do you just need any old CharSequence? If not, then consider implementing a "compact" CharSequence such as the one I suggest in the link.
Upvotes: 0
Reputation: 533520
You could compress the strings. A 30K string should get a good compression ratio. I wrote a hack to compress large String as an exercise, but you could use a byte[] of the compressed data to store the String.
A 30K character string will use about 60KB (2 bytes per character) so even using getBytes() is likely to be an improvement.
Upvotes: 0
Reputation: 272287
String.intern() will help you here (most likely). It will resolve multiple instances of the same string down to one copy.
EDIT: I suggested this would 'most likely' help. In what scenarios will it not ? Interning strings will have the effect of storing those interned string representations permanently. If the problem domain is a one-shot process, this may not be an issue. If it's a long running process (such as a web app) then you may well have a problem.
I would hesitate to say never use interning (I would hesistate to say never do anything). However there are scenarios where it's not ideal.
Upvotes: 9
Reputation: 1377
Do not use String.intern (there have been various memory issues related to this through the years). instead, create your own cache, similar to String.intern. basically, you want a Map, where each key maps to itself. then, before caching any string, you "intern" it:
private Map<String,WeakReference<String>> myInternMap = new WeakHashMap<String,,WeakReference<String>>();
public String intern(String value) {
synchronized(myInternMap) {
WeakReference<String> curRef = myInternMap.get(value);
String curValue = ((curRef != null) ? curRef.get() : null);
if(curValue != null) {
return curValue;
}
myInternMap.put(value, new WeakReference<String>(value));
return value;
}
}
note, you use weakreferences for the keys and values so that you don't keep references for strings which you are no longer using.
Upvotes: 10
Reputation: 28752
String.intern
is the obvious choice as Brian says. But if you don't want to intern across all the String in memory, you can use a Set to first see if the value is present. Here's untested code. You will have to work out removing from reverse map when removing from main
class Map2<K, V> implements Map<K, V>
{
Map<K, V> _map = Maps.newHashMap();
Set<V, V> _rev = Maps.newHashMap();
V put(K k, V v) {
if (_rev.containsKey(v)) {
V prev = _rev.get(v);
return _map.put(k, prev);
} else {
_rev.put(v, v);
return _map.put(k,v);
}
}
Upvotes: 4