Reputation: 41
Creating two constructs for University second semester computer science to count words in a text. One implementation uses an Array with Word-Objects which save the word as string and its frequency as int. The other uses as HashMap with the Word as key and the frequency as value. Now their is a function "totalWords" that should return the sum of all frequency.
In the HashMap variant:
return _map.values().stream().reduce(0, (a, b) -> a + b);
In the array variant:
return Arrays.stream(_words)
.map((word) -> word != null ? word.count() : 0)
.reduce(0, (a, b) -> a + b);
My problem is: in a JUnit test with a very short test text the Array variant does need approximately 0.001s and the map variant needs 0.040s and I do not understand why the map does need so much more time. Has anybody an explanation and maybe a better solution?
Upvotes: 3
Views: 378
Reputation: 7576
One of the reason is that iterating a HashMap
can be much slower than an Array
, the reason is locality
. The computation bottleneck of modern processor is dominated by memory access, and that's why cache
is used. Array
stores data in a contiguous chunk of memory, which means when you swap that chunk of memory into cache, it is more likely that you are using everything in the cache, or you get cache hits
, so cache likes contiguous memory of data. On the other hand, each element of HashMap
is stored in different place in the memory, so when you traverse the HashMap
, you get a lot of cache misses
, you end up with swapping data in and out of cache all the time, which dramatically slows down you program.
Although the actual implementation of HashMap
is in an optimized way such that the data in the memory is clustered together, but even in this case, (@Radiodef)since HashMap
uses some sort of linked list
, each element of HashMap
contains extra pointers, so a HashMap
consumes more memory than an Array
, more memory means more cache misses
and more page faults
, so HashMap
in general is slower than Array
.
Upvotes: 4
Reputation: 37855
A HashMap is a data structure which has (basically) an array of linked lists:
0: [ a ] -> [ b ] -> [ c ]
1: [ ]
2: [ ]
3: [ d ] -> [ e ]
4: [ ]
5: [ f ]
6: [ ]
7: [ ]
The linked lists are places where keys had the same hash code (called a "collision").
So the data structure has "holes" in it, and also it's more fragmented than an array because a HashMap has an object for every entry. Iterating a HashMap will generate more loads from memory than iterating an array.
I also agree with JB Nizet your benchmark is probably flawed. A good benchmark would probably still show that an array performed better but not as stark a difference.
Upvotes: 0