How to improve execution time for the last tasks on Standalone cluster?

Question

I have this code:

JavaRDD termDocsRdd = sc.wholeTextFiles("D:/tmp11", 20).flatMap(
    new FlatMapFunction, Document>() {
        @Override
        public Iterable call(Tuple2 tup) { 
            return Arrays.asList(DocParse.parse(parsingFunction(tup)); 
        }
    }
);

Here I take text files from local storage (not distributed filesystem) and normalize them (each file ~ 100 KB - 1.5 MB). parsingFunction doesn't have any Spark functions like map or flatMap etc. It doesn't contain any data distribution functions.

When I start the app on standalone cluster first I see that workload of all CPUs of worker machine is full (100%), I see in console:

14/12/09 18:30:41 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8,     fujitsu11., PROCESS_LOCAL, 1377 bytes)
14/12/09 18:30:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 12873     ms on fujitsu11. (1/12)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9,     fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:42 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 14001     ms on fujitsu11. (2/12)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10,     fujitsu11., PROCESS_LOCAL, 1327 bytes)
14/12/09 18:30:44 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 15925     ms on fujitsu11. (3/12)
...

Later I see that last tasks are executing much slower - CPUs workload ~ 15%:

14/12/09 18:31:18 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in     33373 ms on fujitsu11. (11/12)
14/12/09 18:32:38 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in     104181 ms on fujitsu11. (12/12)

How can I increase performance of this code?

My cluster is a single master machine and another machine for a slave. All machines have 8 cores CPU and 16 GB RAM.

Daniel Darabos · Accepted Answer

You have 8 executor cores (the ones on the slave). The RDD probably has 20 partitions (from the wholeTextFiles call). When the job starts 20 tasks are created and the executor picks up 8 of them. When one task is finished, a new one is picked up. Eventually there will be less than 8 tasks remaining, and executor threads will start becoming idle. You see the CPU use gradually decrease before the job completes.

Note that you are using one machine (the master is not performing work) and a distributed computing system. This is fine for development and when you don't care about performance. But if you want to improve performance, use more than one machine or do not use Spark.

How to improve execution time for the last tasks on Standalone cluster?

Answers (1)

Related Questions