Low CPU usage while running a spark job

Question

I am running a Spark job. I have 4 cores and worker memory set to 5G. Application master is on another machine in the same network, and does not host any workers. This is my code:

private void myClass() {
    // configuration of the spark context
    SparkConf conf = new SparkConf().setAppName("myWork").setMaster("spark://myHostIp:7077").set("spark.driver.allowMultipleContexts", "true");
    // creation of the spark context in which we will run the algorithm
    JavaSparkContext sc = new JavaSparkContext(conf);

    // algorithm
    for(int i = 0; i<200; i++) {
        System.out.println("===============================================================");
        System.out.println("iteration : " + i);
        System.out.println("===============================================================");
        ArrayList list = new ArrayList();
        for(int j = 0; j < 1900; j++){
            list.add(true);
        }
        JavaRDD ratings = sc.parallelize(list, 100)
                    .map(bool -> new myObj())
                    .map(obj -> this.setupObj(obj))
                    .map(obj -> this.moveObj(obj))
                    .cache();
        int[] stuff = ratings
                    .map(obj -> obj.getStuff())
                    .reduce((obj1,obj2)->this.mergeStuff(obj1,obj2));
        this.setStuff(tour);

        ArrayList tabObj = ratings
                    .map(obj -> this.objToTabObjAsTab(obj))
                    .reduce((obj1,obj2)->this.mergeTabObj(obj1,obj2));
        ratings.unpersist(false);

        this.setTabObj(tabObj);
    }

    sc.close();
}

When I start it, I can see progress on the Spark UI, but it is really slow (I have to set the parallelize quite high, otherwise I have a timeout issue). I thought it was a CPU bottleneck, but the JVM CPU consumption is actually very low (most of the time it is 0%, sometime a bit more than 5%...).

The JVM is using around 3G Of memory according to the monitor, with only 19M cached.

The master host has 4 cores, and less memory (4G). That machine shows 100% CPU consumption (a full core) and I don't understand why it is that high... It just has to send partitions to the worker on the other machine, right?

Why is CPU consumption low on the worker, and high on the master?

SharpLu · Accepted Answer

Make sure you have submit your Spark job by Yarn or mesos in the cluster, otherwise it may only running in your master node.
As your code are pretty simple it should be very fast to finish the computation, but i suggest to use wordcount example try to read few GB of input sources to test how the CPU consuming looks like.
Please use "local[*]" . * means use your All cores for computatation

SparkConf sparkConf = new SparkConf().set("spark.driver.host", "localhost").setAppName("unit-testing").setMaster("local[*]"); References: https://spark.apache.org/docs/latest/configuration.html
In spark there have a lot of things could influence the CPU and memory usage, such as executors and each spark.executor.memory you like to distribute.

Low CPU usage while running a spark job

Answers (1)

Related Questions