Reputation: 8486

What's a good strategy for processing a queue in parallel?

I'm writing a program which needs to recursively search through a folder structure, and would like to do so in parallel with several threads.

I've written the rather trivial synchronous method already - adding the root directory to the queue initially, then dequeuing a directory, queuing its subdirectories, etc., until the queue is empty. I'll use a ConcurrentQueue<T> for my queue, but have already realized that my loops will stop prematurely. The first thread will dequeue the root directory, and immediately every other thread could see that the queue is empty and exit, leaving the first thread as the only one running. I would like each thread to loop until the queue is empty, then wait until another thread queues some more directories, and keep going. I need some sort of checkpoint in my loop so that none of the threads will exit until every thread has reached the end of the loop, but I'm not sure the best way to do this without deadlocking when there really are no more directories to process.

Upvotes: 3

Answers (3)

Theo Lenndorff

Reputation: 4592

If you want to stick to the concept of an explicit queue have a look on the BlockingCollection class. The method GetConsumingEnumerable() returns a IEnumerable which blocks, when the collection has run out of items and continues as soon new items are available. This means whenever the collection is empty the thread is blocked and thus prevents a premature stop of it.

However: Basically this is very useful for producer-consumer scenarios. I am not sure if your problem falls into this category.

Upvotes: 2

Richard

Reputation: 109005

Use the Task Parallel Library.

Create a Task to process the first folder. In this create a Task to process each subfolder (recursively) and a task for each relevant file. Then wait on all the tasks for this folder.

The TPL runtime will make use of the thread pool avoiding creating threads, which is an expensive operation. for small pieces of work.

Note:

If the work per file is trivial do it inline rather than creating another task (IO performance will be the limiting factor).
This approach will generally work best if blocking operations are avoided, but if IO performance is the limit then this might not matter anyway—start simple and measure.
Before .NET 4 much of this can be done with the thread pool, but you'll need to use events to wait for tasks to complete, and that waiting will tie up thread pool threads.¹

¹ As I understand it, in the TPL when waiting on tasks—using a TPL method—TPL will reuse that thread for other tasks until the wait is fulfilled.

Upvotes: 5

IAmTimCorey

Reputation: 16757

It would seem like in this case that your best bet would be to create one thread to start, then whenever you load sub-directories, you should task threads from the thread pool to handle them. Allow your threads to exit when they are done and call new ones from the pool every time you go one step further into the directories. This way there is no deadlock and your system uses threads as it needs them. You could even specify how many threads to start based upon how many folders were found.

Edit: Changed the above to be more clear that you don't want to explicitly create new threads but instead you want to take advantage of the thread pool to add and remove threads as needed without the overhead.

Upvotes: 1

What&#39;s a good strategy for processing a queue in parallel?

Answers (3)

Related Questions

What's a good strategy for processing a queue in parallel?