Reputation: 14379
Lets say I have a list of 1,000,000 users where their unique identifier is their username string. So to compare two User objects I just override the compareTo()
method an compare the username members.
Given a username string I wish to find the User
object from a list. What, in an average case, would be the fastest way to do this.
I'm guessing a HashMap
, mapping usernames to User
objects, but I wondered if there was something else that I didn't know about which would be better.
Upvotes: 0
Views: 154
Reputation: 3011
If you don't change your list of users very often then you may want to use Aho-Corasick. You will need a pre-processing step that will take O(T) time and space, where T is the sum of the lengths of all user names. After that you can match user names in O(n) time, where n is the length of the user name you are looking for. Since you will have to look at every character in the user name you are looking for I don't think it's possible to do better than this.
Upvotes: 0
Reputation: 523
In terms of data structures the HashMap
can be a good choice. It favours larger datasets. The time for inserts is considered constant O(1).
In this case it sounds like you will be carrying out more lookups than inserts. For lookups the average time complexity is O(1 + n/k), the key factor here (sorry about the pun) is how effective the hashing algorithm is at evenly distributing the data across the buckets.
the risk here is that the usernames are short in length and use a small character set such as a-z. In which case there would be a lot of collisions causing the HashMap to be loaded unevenly and therefore slowing down the lookups. One option to improve this could be to create your own user key object and override the hashcode()
method with an algorthim that suits your keys better.
in summary if you have a large data set, a good/suitable hashing algorithm and you have the space to hold it all in memory then HashMap
can provide a relatively fast lookup
I think given your last post on the ArrayList and it's scalabilty I would take Bozho's suggestion and go for a purpose build cache such as EhCache. This will allow you to control memory usage and eviction policies. Still a lot faster than db access.
Upvotes: 0
Reputation: 3158
I'm sure you want a hash map. They're the fastest thing going, and memory efficient. As also noted in other replies, a String works as a great key, so you don't need to override anything. (This is also true of the following.)
The chief alternative is a TreeMap. This is slower and a uses a bit more memory. It's a lot more flexible, however. The same map will work great with 5 entries and 5 million entries. You don't need to clue it in in advance. If your list varies wildly in size, the TreeMap will grab memory as it needs and let it go when it doesn't. Hashmaps are not so good about letting go, and as I explain below, they can be awkward when grabbing more memory.
TreeMap's work better with Garbage Collectors. They ask for memory in small, easily found chunks. If you start a hashtable with room for 100,000 entries, when it gets full it will free the 100,000 element (almost a megabye on a 64 bit machine) array and ask for one that's even larger. If it does this repeatedly, it can get ahead of the GC, which tends to throw an out-of-memory exception rather than spend a lot of time gathering up and concentrating scattered bits of free memory. (It prefers to maintain its reputation for speed at the expense of your machine's reputation for having a lot of memory. You really can manage to run out of memory with 90% of your heap unused because it's fragmented.)
So if you are running your program full tilt, your list of names varies wildly in size--and perhaps you even have several lists of names varying wildly in size--a TreeMap will work a lot better for you.
A hash map will no doubt be just what you need. But when things get really crazy, there's the ConcurrentSkipListMap. This is everything a TreeMap is except it's a bit slower. On the other hand, it allows adds, updates, deletes, and reads from multiple threads willy-nilly, with no synchronization. (I mention it just to be complete.)
Upvotes: 0
Reputation: 597114
If you don't need to store them in a database (which is the usual scenario), a HashMap<String, User>
would work fine - it has O(1) complexity for lookup.
As noted, the usual scenario is to have them in the database. But in order to get faster results, caching is utilized. You can use EhCache - it is similar to ConcurrentHashMap
, but it has time-to-live for elements and the option to be distributed across multiple machines.
You should not dump your whole database in memory, because it will be hard to synchronize. You will face issues with invalidating the entries in the map and keeping them up-to-date. Caching frameworks make all this easier. Also note that the database has its own optimizations, and it is not unlikely that your users will be kept in memory there for faster access.
Upvotes: 6