Reputation: 1281
I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.
As below snippet, I made a simple code for removing duplicated values.
// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.
String[] data = new String[10_000_000];
HashMap<String, String> uniqueItems = new HashMap<>();
for (int i = 0; i < data.length; i++) {
if (uniqueItems.containsKey(data[i])) {
uniqueItems.remove(data[i]);
uniqueItems.put(data[i], "inserted");
} else {
uniqueItems.put(data[i], "inserted");
}
}
However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.
So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.
And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.
Upvotes: 1
Views: 6386
Reputation: 19349
Convert your uniqueItems
to a HashSet<String>
and your for loop to simply:
uniqueItems.add(data[i]);
If add
returns true
then you have inserted an unique string; false
if duplicated.
Both algorithms should run in O(n) time in the best case, but using a HashMap
when you don't care about the value (for a given key) is silly and wastes resources. A HashSet
is a better fit for such cases.
You could also try a TreeSet<String>
to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet
implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable
types such as String
.)
Brute force searching arrays. In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.
As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet
version in the general case.
Linear is theoretical best. You can't detect all duplicates without visiting each element at least once. As such, our current O(n) time complexity is really the best you can do here.
Of course, you can always try shaving down some of the hidden constants in the Big O notation, but you won't arrive in a asymptotically better algorithm.
Upvotes: 7
Reputation: 254
In your example the data[i] values are used as 'key's in HashMap uniqueItems.
A HaspMap will always have unique keys. An existing key will be overwritten by a put() operation. You need no conatinsKey() if you want to add a new element.
Why do you remove and insert an existing key ?
Upvotes: 1