Reputation: 75
I have 40 million data in mongoDB. I am reading that data in parallel from collection, processing it and dumping into another collection.
Sample code for job initialization.
ExecutorService executor = Executors.newFixedThreadPool(10);
int count = total_number_of_records in reading collection
int pageSize = 5000;
int counter = (int) ((count%pageSize==0)?(count/pageSize):(count/pageSize+1));
for (int i = 1; i <= counter; i++) {
Runnable worker = new FinalParallelDataProcessingStrategyOperator(mongoDatabase,vendor,version,importDate,vendorId,i,securitiesId);
executor.execute(worker);
}
Each thread is doing following thing
public void run() {
try {
List<SecurityTemp> temps = loadDataInBatch();
populateToNewCollection(temps);
populateToAnotherCollection(temps);
} catch (IOException e) {
e.printStackTrace();
}
}
Load data is paginated by using following query
mongoDB.getCollection("reading_collection").find(whereClause).
.skip(pagesize*(n-1)).limit(pagesize).batchSize(1000).iterator();
Machine Configuration : 2 CPU with 1 core each
Parallel implementation is giving almost same performance as sequential. Stats on subset of data (319568 records)
No. of Threads Execution Time(minutes)
1 16
3 15
8 17
10 17
15 16
20 12
50 30
How to improve performance of this application?
Upvotes: 1
Views: 2069
Reputation: 38910
Multi threading does not improve performance with increase in number of threads.
IO bound applications won't gain much from multi threading.
It depends on lot of factors. Refer to this related SE question:
Is multithreading faster than single thread?
Even for less IO bound, CPU intensive applications, don't configure huge number of threads to improve performance.
You can change your code as :
ExecutorService executor = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors());
Or ( ForkJoinPool as below [works from jdk 1.8 release on-wards )
ExecutorService executor = Executors.newWorkStealingPool()
Executors API:
public static ExecutorService newWorkStealingPool()
Creates a work-stealing thread pool using all available processors as its target parallelism leve
Upvotes: 2
Reputation: 11911
Since you are reading your input-data from a single source that part is most likely IO-bound (from the perspective of your application), so executing it in parallel will not gain you much. on the contrary - I think executing a similar query (just with different pagination) in parrallel on multiple threads will have a negative performance-impact: the same work has to be done multiple times on the DB and the parallel queries might get into each others way.
Another question is, whether the processing-part takes up a significant amount of time in comparison with readinhg the input. If it doesn't using parallel processing will not help much to speed things up. If it does I suggest the following:
As for the number of threads: the "sweet spot" for minimum processing time depends on the kind of processing. For CPU-intensive tasks without much IO-processing it will most likely be around the number of available cores - in your case 2.
Upvotes: 5