Reputation: 13659
There's a productivity tool being used in our department. Basically what it does is extract data from multiple excel files, make some data transformation, and export the corresponding output as text files.
I managed to have the copy of the source code and investigated how the text files are being generated. I found out that the developer created multiple BackgroundWorkers, one for each report that will be generated. It looks like this:
bgWorkerGenerateTextReport_1.RunWorkerAsync(); // inside the doWork method, it calls the actual method that generates the text file
bgWorkerGenerateTextReport_2.RunWorkerAsync();
bgWorkerGenerateTextReport_3.RunWorkerAsync();
bgWorkerGenerateTextReport_4.RunWorkerAsync();
bgWorkerGenerateTextReport_5.RunWorkerAsync();
bgWorkerGenerateTextReport_6.RunWorkerAsync();
// more bgWorkers follow...
At the completion of each backgroundWorker, it makes the linkLabel visible that corresponds to the location of the text file so the user can click it.
Some of the generated text files are very large (some contains almost a million rows and around 200 columns). I wanted to improve the tool since I have access to the source code.
First, I want to know what would be the better way to generate the text reports in parallel compared to declaring multiple backgroundWorkers. I know the original approach works, but I'm wondering what would be the more elegant and proper approach.
I tried directly calling the methods the generate the different reports but the UI became unresponsive while processing the files.
Upvotes: 0
Views: 783
Reputation: 150228
Before attempting any optimization, I would benchmark the current IO and CPU utilization throughout the entire runtime of this operation. If those are close to saturation the whole time, no other tuning is likely to have a significant impact.
To make this process work as fast as possible (if that's the goal), you want to optimize the use of each resource involved. The downside of doing that is, anything else running on the same hardware may experience significant delays.
When doing this type of processing, I tend to use a Producer/Consumer pattern.
You might investigate having a multi-threaded producer that reads the files, feeding the data to a multi-threaded consumer to do the processing. You would then have a multi-threaded consumer of the processed data to write the results.
Read Data -> Transform Data -> Write Data
The number of threads in each layer should be tuned based on performance measurements. This allows you to tune your data transformation pipeline to make optimal use of available IO and CPU resources.
Channels are often (but not always) the best choice in .NET to create this type of pipeline.
Upvotes: 2