Reputation: 13
My program reads text files line by line extracting specific types of words in every line (it's important in which line the word was found). What would be better, separate threads by files (each thread would be reading a different file) or separate them by lines (each thread would be reading a different line from the same file)?
Upvotes: 1
Views: 220
Reputation: 50063
As always in performance questions, you should probably try both and measure if feasible. But here is what my intuitions says:
If the files have similar size / take similar time to process, giving each thread their own file is probably best.
Many threads accessing one file is probably only worth it if the computation time dominates the file IO time.
But again, you should measure. Guessing about performance goes wrong often enough. As @Jerry Coffin points out, it is quite possible that neither will help you, but on the other hand, the files may already be pre-loaded into RAM, in which case this point may or may not apply (to full extend). Really, just try and see. This is a wide field and hard to predict.
Upvotes: 3
Reputation: 490138
Unless you have multiple hard drives, probably neither.
The hard drive is inherently single-threaded--that is, it produces only a single stream of data at any given time. With an actual hard drive with a spinning disc and a head that seeks around the disc, your best throughput will usually come from reading sequentially. Seeking around in the file or between separate files to different spots can reduce throughput substantially.
If you do have multiple drives, then it'll depend on how your data is distributed across the drives, but ideally you'd probably want something like one thread dedicated to reading data from each physical drive.
If you have sufficient processing to do on the data once it's read, you can have a single thread reading the data, and putting that data into some sort of thread-safe queue. From there you have processing threads that take individual data items, process them, and write the result to...wherever you want your output.
If that's going back to a file (or multiple files) you probably want more or less the reverse here: a single thread to write output to each result disc, and the processing threads deposit their data in some sort of queue. In a typical case, that'll be a priority queue ordered by the order in which the data should be written to the output file, so the output thread always writes the data sequentially.
Upvotes: 3
Reputation: 89
Depends on how many files there are and how many lines there are per file.
If you have relatively few lines in each file then the parallelisation will not be worth the overhead. Same goes for if you're handling relatively few files.
Could always parallelise both.
Upvotes: 2