Jorge
Jorge

Reputation: 1634

HashSet vs ArrayList contains performance

When processing large amounts of data I often find myself doing the following:

HashSet<String> set = new HashSet<String> ();
//Adding elements to the set
ArrayList<String> list = new ArrayList<String> (set);

Something like "dumping" the contents of the set in the list. I usually do this since the elements I add often contain duplicates I want to remove, and this seems like an easy way to remove them.

With only that objective in mind (avoiding duplicates) I could also write:

ArrayList<String> list = new ArrayList<String> ();
// Processing here
if (! list.contains(element)) list.add(element);
//More processing here

And thus no need for "dumping" the set into the list. However, I'd be doing a small check before inserting each element (which I'm assuming HashSet does as well)

Is any of the two possibilities clearly more efficient?

Upvotes: 59

Views: 69782

Answers (7)

Jesse Barnum
Jesse Barnum

Reputation: 6856

I was curious about this and wrote a quick benchmark in Java 17... Set.contains() is 1.77 times faster than List.contains() when doing a billion lookups on a 3 item collection, and 1,244 times faster when doing a million lookups for a 10,000 item collection.

So Set.contains() is faster than List.contains(), even for small collections.

Upvotes: 0

cquezel
cquezel

Reputation: 4497

I did a small trivial test of the "contains" method using random strings on Java 17 using TreeSet, HashSet and ArrayList.

The break even point is around 5 elements in the collections. 4 or less elements, ArrayList is faster. 6 or more elements, HashMap is faster.

Intuitively, i would have thought that the 5 value would be much higher and that TreeSet would have outperformed HashSet for smaller sizes.

Upvotes: 3

Dici
Dici

Reputation: 25980

The set will give much better performance (O(n) vs O(n^2) for the list), and that's normal because set membership (the contains operation) is the very purpose of a set.

Contains for a HashSet is O(1) compared to O(n) for a list, therefore you should never use a list if you often need to run contains.

Upvotes: 116

urs86ro
urs86ro

Reputation: 107

I've made a test so please check the result:

For SAME STRING items in a HashSet, TreeSet, ArrayList and LinkedList, here are the results for

  1. 50.000 UUIDs
    • SEARCHED ITEM : e608c7d5-c861-4603-9134-8c636a05a42b (index 25.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 2 ms
    • linkedList.contains(item) ? TRUE 3 ms
  2. 5.000.000 UUIDs
    • SEARCHED ITEM : 61fb2592-3186-4256-a084-6c96f9322a86 (index 25.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 1 ms
    • linkedList.contains(item) ? TRUE 2 ms
  3. 5.000.000 UUIDs
    • SEARCHED ITEM : db568900-c874-46ba-9b44-0e1916420120 (index 2.500.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 33 ms
    • linkedList.contains(item) ? TRUE 65 ms

Based on above results, there is NOT a BIG difference of using array list vs set. Perhaps you can try to modify this code and replace the String with your Object and see the differences then...

    public static void main(String[] args) {
        Set<String> hashSet = new HashSet<>();
        Set<String> treeSet = new TreeSet<>();
        List<String> arrayList = new ArrayList<>();
        List<String> linkedList = new LinkedList<>();

        List<String> base = new ArrayList<>();

        for(int i = 0; i<5000000; i++){
            if(i%100000==0) System.out.print(".");
            base.add(UUID.randomUUID().toString());
        }

        System.out.println("\nBase size : " + base.size());
        String item = base.get(25000);
        System.out.println("SEARCHED ITEM : " + item);

        hashSet.addAll(base);
        treeSet.addAll(base);
        arrayList.addAll(base);
        linkedList.addAll(base);

        long ms = System.currentTimeMillis();
        System.out.println("hashSet.contains(item) ? " + (hashSet.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("treeSet.contains(item) ? " + (treeSet.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("arrayList.contains(item) ? " + (arrayList.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("linkedList.contains(item) ? " + (linkedList.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
    }

Upvotes: 9

Prateek Paranjpe
Prateek Paranjpe

Reputation: 543

You could add elements to the list itself. Then, to dedup -

HashSet<String> hs = new HashSet<>(); // new hashset
hs.addAll(list); // add all list elements to hashset (this is the dedup, since addAll works as a union, thus removing all duplicates)
list.clear(); // clear the list
list.addAll(hs); // add all hashset elements to the list

If you just need a set with dedup, you can also use the addAll() on a different set, so that it will only have unique values.

Upvotes: 1

YoungHobbit
YoungHobbit

Reputation: 13402

The ArrayList uses an array for storing the data. The ArrayList.contains will be of O(n) complexity. So essentially searching in array again and again will have O(n^2) complexity.

While HashSet uses hashing mechanism for storing the elements into their respective buckets. The operation of HashSet will be faster for long list of values. It will reach the element in O(1).

Upvotes: 18

Peter Lawrey
Peter Lawrey

Reputation: 533880

If you don't need a list, I would just use a Set and this is the natural collection to use if order doesn't matter and you want to ignore duplicates.

You can do both is you need a List without duplicates.

private Set<String> set = new HashSet<>();
private List<String> list = new ArrayList<>();


public void add(String str) {
    if (set.add(str))
        list.add(str);
}

This way the list will only contain unique values, the original insertion order is preserved and the operation is O(1).

Upvotes: 6

Related Questions