Most fastest and efficient way to remove duplicates in Java

Question

I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.

As below snippet, I made a simple code for removing duplicated values.

// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.    
String[] data = new String[10_000_000];

HashMap uniqueItems = new HashMap<>();

for (int i = 0; i < data.length; i++) {
    if (uniqueItems.containsKey(data[i])) {
        uniqueItems.remove(data[i]);
        uniqueItems.put(data[i], "inserted");
    } else {
        uniqueItems.put(data[i], "inserted");
    }
}

However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.

So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.

And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.

Paulo Mattos · Accepted Answer

Convert your uniqueItems to a HashSet and your for loop to simply:

uniqueItems.add(data[i]);

If add returns true then you have inserted an unique string; false if duplicated.

Both algorithms should run in O(n) time in the best case, but using a HashMap when you don't care about the value (for a given key) is silly and wastes resources. A HashSet is a better fit for such cases.

You could also try a TreeSet to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable types such as String.)

Brute force searching arrays. In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.

As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet version in the general case.

Linear is theoretical best. You can't detect all duplicates without visiting each element at least once. As such, our current O(n) time complexity is really the best you can do here.

Of course, you can always try shaving down some of the hidden constants in the Big O notation, but you won't arrive in a asymptotically better algorithm.

Most fastest and efficient way to remove duplicates in Java

Answers (2)

Related Questions