Reputation: 17548
Code speaks more than words, so:
final int size = 100;
Map<Integer, String> m = new HashMap<>(size);
for (int i = 0; i < size; i++) m.put(i, String.valueOf(i));
Why is the HashMap internally calling resize()
21
2
times! (Credit to Andreas for identifying that the JVM uses HashMaps internally, 19 of the 21 cals were from other processes)
Two resize()
calls is still not acceptable for my application. I need to optimize this.
If I am a new java developer, my first intuitive guess at what "capacity" means in the HashMap constructor is that it is the capacity for the number of elements that I (the consumer of HashMap) am going to put into the Map. But this is not true.
If I want to optimize my usage of HashMap so that it does not need to resize itself at all, then I need to know the internals of HashMap intimately enough to know exactly how sparse the HashMap bucket array needs to be. This is strange in my opinion. HashMap should implicitly do this for you. It is the whole point of encapsulation in OOP.
Note: I have confirmed that resize() is the bottleneck for my applications use case, so that is why my goal is to reduce the number of calls to resize().
The question:
If I know the exact quantity of entries I am going to put into the map beforehand. What capacity do I chose, to prevent any extra calls resize()
operations? Something like size * 10
? I would also like some background on why HashMap
is designed this way.
Edit: I am getting asked a lot why this optimization is necassary. My application is spending a non-trivial amount of CPU time in hashmap.resize(). The hashmaps my application uses are initialized with a capacity equal to the number of elements that we put into it. Therefore, if we can reduce the resize() calls (by choosing a better initial capacity), then my application performance is improved.
Upvotes: 10
Views: 8046
Reputation: 1476
Java has many distributions and versions, so you have to test it yourself.
HashMap
gives unpredictable resizing results because it has 2
conditions for resizing to take place ((size >= threshold) && (null != table[bucketIndex])
). First, size must be >=
threshold (cap * load factor). Second, the current bucket must already have an entry, which implies a collision.
4
, resizing takes place when the 5
th entry is put.8
, resizing takes place when the 9
th entry is put.16
, resizing takes place when the 16
th entry is put (not 17
th).32
, resizing takes place when the 32
th entry is put (not 33
th).64
, resizing takes place when the 64
th entry is put (not 65
th).HashMap
resizes when the size is >
threshold (capacity * load factor).
16
and default load factor of 0.75
, resizing (to capacity of 32
) takes place when the 13
th entry is put.4
, resizing takes place when the 4
th entry is put.8
, resizing takes place when the 7
th entry is put.16
, resizing takes place when the 13
th entry is put.32
, resizing takes place when the 25
th entry is put.64
, resizing takes place when the 49
th entry is put.public class HashMapTest {
public static void main(String[] args) {
int cap = 4;
int size = 64;
Map<Integer, String> map = new HashMap<>(cap);
for (int i=1; i<=size; i++) {
map.put(i, i+"");
print(map);
}
}
public static void print(Map map) {
try {
Class<?> mapType = map.getClass();
Method capacity = mapType.getDeclaredMethod("capacity");
capacity.setAccessible(true);
System.out.println("capacity : " + capacity.invoke(map) + " size : " + map.size());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Upvotes: 0
Reputation: 17548
There are a lot of wonderful answers here. I greatly appreciate the contributions.
I have decided to not re-invent this wheel, because it appears google has already solved this problem.
I am going to use the utility method Maps.newHashMapWithExpectedSize(int)
from Google's guava library
Upvotes: 1
Reputation: 14389
Resizing is an essential part of a hashmap's working to keep the load factor low.
Load factor needs to be low because the hashing function of the hashmap will inevitably start having collisions when hashmap's buckets are maxing out. The collisions can start from second entry itself if your entries hash to an occupied bucket every time.
However, in your particular case, collision is not a problem, only hashmap's resizing is.
Hashmap is generally resized at 0.75 ( = 3/4 in the fraction) load factor. Using this information, you can set up a hashmap of 4/3 times the count of the entries you need to store.
Regarding your dissent with breaking encapsulation:
I agree with you It is debatable.
You can say that it would be better if capacity
represented the count of entries until which the resize does not happen, rather than the count of maximum possible entries that could be stored in the hashmap - and I tend to agree with you.
But someone else could also argue the other side as to why the hashmap is occupying more space than it was instructed to reserve.
The solution to this problem lies in Java's domain. Java can provide two constructors which are explicit enough about what they will do and then the developers can have their pick during their hashmap's initialization.
Upvotes: 0
Reputation: 120858
This is easy to prove:
private static <K, V> void debugResize(Map<K, V> map, K key, V value) throws Throwable {
Field table = map.getClass().getDeclaredField("table");
AccessibleObject.setAccessible(new Field[] { table }, true);
Object[] nodes = ((Object[]) table.get(map));
// first put
if (nodes == null) {
map.put(key, value);
return;
}
map.put(key, value);
Field field = map.getClass().getDeclaredField("table");
AccessibleObject.setAccessible(new Field[] { field }, true);
int x = ((Object[]) field.get(map)).length;
if (nodes.length != x) {
++currentResizeCalls;
}
}
And some usage:
static int currentResizeCalls = 0;
public static void main(String[] args) throws Throwable {
int size = 100;
Map<Integer, String> m = new HashMap<>(size);
for (int i = 0; i < size; i++) {
DeleteMe.debugResize(m, i, String.valueOf(i));
}
System.out.println(DeleteMe.currentResizeCalls);
}
I am only logging the times it takes when resize
actually resizes, because the first call is initializing; as the documentation stipulates:
Initializes or doubles table size
The second of your points is far more interesting. A HashMap
defines capacity
, now what capacity is? And this is not that obvious:
For HashMap
, capacity
is the number of buckets
before a resize happens, for ConcurrentHashMap
it's the number of entries before resize is performed.
Thus, not to call resize internally, in case of HashMap
use the formula:
(int)(1.0 + (long)initialCapacity / LOAD_FACTOR)
But this is by far not ideal, say you want 1024
entries without a resize, by using that formula you get to 1367
buckets, which internally are rounded to a power of two, thus 2048
- well, a lot more than you asked for.
For CHM
specify the size directly. Easy to prove using one single modification in the previous code:
// use CHM instead of HashMap
Map<Integer, String> m = new ConcurrentHashMap<>(size);
This will result in zero
resizes that actually double the array. But sometimes even CHM
internal code is confusing and requires little patching.
Upvotes: 1
Reputation: 4699
When in doubt read the Documentation. The docs for HashMap
explain the trade offs of initial capacity
and load-factor
quite well.
According to the documentation if initCapacity = (maxEntries / loadFactor) + 1
then no rehash operations will occur upon adding entries. Where in this case, the maxEntries
is 100
as you specify and the loadFactor
would be the default load factor of .75
.
But aside from just setting an initial size to avoid a rehash (resize()
) you should carefully read the documentation of HashMap
in order to tune it properly, taking both initial capacity and load factor into account.
If you care about lookup cost more than space, then perhaps try with a lower loadFactor
s like .5
or lower if you like. In that case you would create your hash map with both parameters like this:
final float loadFactor = 0.5;
final int maxEntries = 100;
final int initCapacity = (int) maxEntries / loadFactor + 1;
new HashMap<>(initCapacity, loadFactor);
(emphasis mine)
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal datastructures are rebuilt) so that the hash table has approximately twice the number of buckets.
...
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
Upvotes: 8
Reputation: 159114
The default Load Factor is 0.75
, i.e. 3/4
, which means that the internal hash table will be resized when 75 of the 100 values have been added.
FYI: resize()
is only called twice. Once when the first value is added, and once when it gets to 75% full.
To prevent resizing, you need to ensure that the 100th value will not cause resizing, i.e. size <= capacity * 0.75
aka size <= capacity * 3/4
aka size * 4/3 <= capacity
, so to be sure:
capacity = size * 4/3 + 1
With size = 100
, that means capacity = 134
.
Upvotes: 16