Java multi threading performance worst as increasing thread pool size

Question

I have 40 million data in mongoDB. I am reading that data in parallel from collection, processing it and dumping into another collection.

Sample code for job initialization.

ExecutorService executor = Executors.newFixedThreadPool(10);
int count = total_number_of_records in reading collection
int pageSize = 5000;
int counter = (int) ((count%pageSize==0)?(count/pageSize):(count/pageSize+1));
for (int i = 1; i <= counter; i++) {
        Runnable worker = new FinalParallelDataProcessingStrategyOperator(mongoDatabase,vendor,version,importDate,vendorId,i,securitiesId);
        executor.execute(worker);
    }

Each thread is doing following thing

public void run() {
    try {
        List temps = loadDataInBatch();
        populateToNewCollection(temps);
        populateToAnotherCollection(temps);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Load data is paginated by using following query

mongoDB.getCollection("reading_collection").find(whereClause).
            .skip(pagesize*(n-1)).limit(pagesize).batchSize(1000).iterator();

pagination code reference

Machine Configuration : 2 CPU with 1 core each

Parallel implementation is giving almost same performance as sequential. Stats on subset of data (319568 records)

No. of Threads   Execution Time(minutes)

   1                 16 
   3                 15
   8                 17
   10                17
   15                16
   20                12
   50                30

How to improve performance of this application?

piet.t · Accepted Answer

Since you are reading your input-data from a single source that part is most likely IO-bound (from the perspective of your application), so executing it in parallel will not gain you much. on the contrary - I think executing a similar query (just with different pagination) in parrallel on multiple threads will have a negative performance-impact: the same work has to be done multiple times on the DB and the parallel queries might get into each others way.

Another question is, whether the processing-part takes up a significant amount of time in comparison with readinhg the input. If it doesn't using parallel processing will not help much to speed things up. If it does I suggest the following:

Get your data from the DB using a single query
Have multiple worker-threads that get the data-items from the result-set or an intermediate queue and process them. There's no need to have fixed batches, each worker just grabs the next available item once it finished processing the previous one.

As for the number of threads: the "sweet spot" for minimum processing time depends on the kind of processing. For CPU-intensive tasks without much IO-processing it will most likely be around the number of available cores - in your case 2.

Java multi threading performance worst as increasing thread pool size

Answers (2)

Related Questions