Igor222
Igor222

Reputation: 33

Multithreaded search of single collection for duplicates

use I can't divide into segads. As for my above example if 5 threads are set, then first segment would take 2 first object, and second 3th and 4th, so they dont find dups, but there are dups if we merge them, its 2th and 3th.

There could be more complex strate take from first threads .. ah nevermind, to hard to explain.

And ofcourse, problelection itself in my plans.

Tha EDIT:

InChunk, and then continue analyzing that chunk till the end. ;/

Upvotes: 2

Views: 403

Answers (2)

Gray
Gray

Reputation: 116878

I think the process of dividing up the items to be de-duped is going to have to look at the end of the section and move forward to encompass dups past it. For example, if you had:

1  1  2 . 2  4  4 . 5  5  6

And you dividing up into blocks of 3, then the dividing process would take 1 1 2 but see that there was another 2 so it would generate 1 1 2 2 as the first block. It would move forward 3 again and generate 4 4 5 but see that there were dups forward and generate 4 4 5 5. The 3rd thread would just have 6. It would become:

1  1  2  2 . 4  4  5  5 . 6

The size of the blocks are going to be inconsistent but as the number of items in the entire list gets large, these small changes are going to be insignificant. The last thread may have very little to do or be short changed altogether but again, as the number of elements gets large, this should not impact the performance of the algorithm.

I think this method would be better than somehow having one thread handle the overlapping blocks. With that method, if you had a lot of dups, you could see it having to handle a lot more than 2 contiguous blocks if you were unlucky in the positing of the dups. For example:

1  1  2 . 2  4  5 . 5  5  6

One thread would have to handle that entire list because of the 2s and the 5s.

Upvotes: 4

Tudor
Tudor

Reputation: 62439

I would use a chunk-based division, a task queue (e.g. ExecutorService) and private hash tables to collect duplicates.

Each thread in the pool will take chunks on demand from the queue and add 1 to the value corresponding to the key of the item in the private hash table. At the end they will merge with the global hash table.

At the end just parse the hash table and see which keys have a value greater than 1.

For example with a chunk size of 3 and the items:

1 2 2 2 3 4 5 5 6 6

Assume to have 2 threads in the pool. Thread 1 will take 1 2 2 and thread 2 will take 2 3 4. The private hash tables will look like:

1 1
2 2
3 0
4 0
5 0
6 0

and

1 0
2 1
3 1
4 1
5 0
6 0

Next, thread 1 will process 5 5 6 and thread 2 will process 6:

1 1
2 2
3 0
4 0
5 2
6 1  

and

1 0
2 1
3 1
4 1
5 0
6 1

At the end, the duplicates are 2, 5 and 6:

1 1
2 3
3 1
4 1
5 2
6 2

This may take up some amount of space due to the private tables of each thread, but will allow the threads to operate in parallel until the merge phase at the end.

Upvotes: 2

Related Questions