Lets assume we have a next task (very abstract): We have a folder with various amount of files to process (files count maybe 1, 2, or few thousands). Each file can be processed only sequentially (it means that it is impossible to read whole file in memory and process it in several threads). The result of file processing should be generated new file, also written sequentially. How to do it using all available CPU cores? I see only two approaches: Use Task Queue which is processed by several threads. Where each task is processing single file, like read chunk from file, process chunk, write chunk to result file. Use something like Pipeline pattern. We have one input thread, which reads files in async ways and post chunk to several processing queues. Each thread reads own queue and doing chunk processing. Then post result to output queue. Output thread writes result files. So we have 1 input reading thread, 1 output writing thread and several process threads. Chunk processing is not very fast operation, slower then reading. OS: Mac/Linux, maybe Windows. Which approach is better? Do we have any other solutions?

c++multithreadingasynchronousfilesystems

Reputation: 189

How to choose best I/O strategy for multiple files processing?

Lets assume we have a next task (very abstract):

We have a folder with various amount of files to process (files count maybe 1, 2, or few thousands). Each file can be processed only sequentially (it means that it is impossible to read whole file in memory and process it in several threads). The result of file processing should be generated new file, also written sequentially. How to do it using all available CPU cores?

I see only two approaches:

Use Task Queue which is processed by several threads. Where each task is processing single file, like read chunk from file, process chunk, write chunk to result file.
Use something like Pipeline pattern. We have one input thread, which reads files in async ways and post chunk to several processing queues. Each thread reads own queue and doing chunk processing. Then post result to output queue. Output thread writes result files. So we have 1 input reading thread, 1 output writing thread and several process threads.

Chunk processing is not very fast operation, slower then reading.

OS: Mac/Linux, maybe Windows.

Which approach is better? Do we have any other solutions?

Upvotes: 0

Answers (3)

Serge

Reputation: 12364

There are certain advantages and disadvantages in both approaches.

Single reader

pro: provides good academic abstraction layer of reader from processing and might end up in a better programming model.
pro: Also, provided that you can supply all other threads with the jobs (if the procession is significanly slower than your read operatio) and the file system is not fragmented, than you can get better utilization of the system.
con: #2 is difficult to achieve, so most likely this advantage will not be there.
con: it is more difficult to program, you would need a fifo or other queue and synchronization on the top.

The read in the processing thread:

pro: easy to implement, no queues or sync is needed
pro: better works in fragmented file systems: multiple io requests could be optimized by os or hardware to reduce latency. Also, requests with lower latency will kick off the processing sooner, whether in the single reader you would have to wait no matter what.
pro: The read operation itself has a non-system part which could also run in a thread and be a part of the parallel processing.
con: will probably loose in non-fragmented systems since it adds some fragmentation-like behavior.

BTW, there are more possible processing schemes. One you forgot to mention is to have a single writer thread, where your processing dumps results in the queue and let the background process to write it. This might give you additional boost. There is no need for every thread to wait for writes.

You can also use parallel readers which write in one queue, than the processing takes from this queue (even more complicated programming :-) but works in some cases.

Well, parallel writers could also work.

Also you can thing about distributing your files between different local disks (not directories, but physical disks). This would definitely increase your read/write performance if done in parallel.

Upvotes: 0

TwoThe

Reputation: 14289

The best approach would be to write a simple Task class that does the whole operation (read, process, write) stand-alone, so without any ties to external, thread-unsafe operations. Then use a task queue where a fixed number of threads can fetch those task and process them. A good number of threads is usually cores * 2.

It can be mathematically proven that option 2 will always be equal or slower than the task-based solution, and in all cases it will be more complicated. The only situation where option 2 is more viable is when thread-switching becomes the actual bottleneck. I.E. if you have a server with like 1000 concurrent but stateful connections but only one network card, then it is more efficient to have 1 network thread that feeds the 1000 processing threads, instead of waking up 1000 threads on every byte sent over the line.

The task based solution also makes it much easier to measure the throughput and compare how additional threads affect it, as you can simply measure in tasks / second.

Upvotes: 1

MSalters

Reputation: 179981

Probably the simplest efficient solution is to have a single reader thread, at lower than default priority. If there is a free CPU core, it gets to run. This creates a single worker thread (process one input file and write it back). As these threads run at default priority, this will self-balance. When all CPU's are busy processing files, the reader thread will not get much CPU time so not a lot of new worker threads are generated.

There's no real point in separating the processing of files and writing them back to disk; that just generates the possibility of a lot of unwritten work queuing up in memory.

Upvotes: 1

How to choose best I/O strategy for multiple files processing?

Answers (3)

Related Questions