Multithreaded search of single collection for duplicates

Question

use I can't divide into segads. As for my above example if 5 threads are set, then first segment would take 2 first object, and second 3th and 4th, so they dont find dups, but there are dups if we merge them, its 2th and 3th.

There could be more complex strate take from first threads .. ah nevermind, to hard to explain.

And ofcourse, problelection itself in my plans.

Tha EDIT:

InChunk, and then continue analyzing that chunk till the end. ;/

Tudor · Accepted Answer

I would use a chunk-based division, a task queue (e.g. ExecutorService) and private hash tables to collect duplicates.

Each thread in the pool will take chunks on demand from the queue and add 1 to the value corresponding to the key of the item in the private hash table. At the end they will merge with the global hash table.

At the end just parse the hash table and see which keys have a value greater than 1.

For example with a chunk size of 3 and the items:

1 2 2 2 3 4 5 5 6 6

Assume to have 2 threads in the pool. Thread 1 will take 1 2 2 and thread 2 will take 2 3 4. The private hash tables will look like:

and

Next, thread 1 will process 5 5 6 and thread 2 will process 6:

and

At the end, the duplicates are 2, 5 and 6:

This may take up some amount of space due to the private tables of each thread, but will allow the threads to operate in parallel until the merge phase at the end.

Multithreaded search of single collection for duplicates

Answers (2)

Related Questions