Apache Spark Map Job Slow

Question

I've been experimenting with Apache Spark to see if it can be used to make an analysis engine for data we have store in an Elasticsearch cluster. I've found that with any significant RDD size (i.e. several million records), that even the simplest operations take more than a minute.

For example, I made this simple test program:

package es_spark;

import java.util.Map;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

public class Main {

    public static void main (String[] pArgs) {

        SparkConf conf = new SparkConf().setAppName("Simple Application");
        conf.set("es.nodes", pArgs[0]);

        JavaSparkContext sc = new JavaSparkContext(conf);

        long start = System.currentTimeMillis();
        JavaPairRDD> esRDD = JavaEsSpark.esRDD(sc, "test3");
        long numES = esRDD.count();
        long loadStop = System.currentTimeMillis();

        JavaRDD dummyRDD = esRDD.map(pair -> {return 1;});
        long numDummy = dummyRDD.count();
        long mapStop = System.currentTimeMillis();

        System.out.println("ES Count: " + numES);
        System.out.println("ES Partitions: " + esRDD.getNumPartitions());

        System.out.println("Dummy Count: " + numDummy);
        System.out.println("Dummy Partitions: " + dummyRDD.getNumPartitions());

        System.out.println("Data Load Took: " + (loadStop - start) + "ms");
        System.out.println("Dummy Map Took: " + (mapStop - loadStop) + "ms");

        sc.stop();
        sc.close();
    }
}

I've run this on a spark cluster with 3 slaves, each with 14 cores and 49.0GB of RAM. With the following command:

./bin/spark-submit --class es_spark.Main --master spark://:7077 ~/es_spark-0.0.1.jar

The output is:

ES Count: 8140270
ES Partitions: 80
Dummy Count: 8140270
Dummy Partitions: 80
Data Load Took: 108059ms
Dummy Map Took: 104128ms

It takes 1.5+ minutes to perform the dummy map job on the 8+ million records. I find this performance surprisingly low given that the map job does nothing. Am I doing something wrong or is this about normal performance for Spark?

I've also tried twidding the --executor-memory and --executor-cores without much difference.

Alper t. Turker · Accepted Answer

find this performance surprisingly low given that the map job does nothing.

Map job doesn't do nothing. It has to fetch complete dataset from Elastic search. Because data is not cached it happens twice, once for each action. This time also includes some initialization time.

Overall you measure:

Time of the ES query.
Network latency between Spark cluster and ES.

and some secondary things like:

Time of full initialization of the executor JVMs.
Probablly GC pause time.

Apache Spark Map Job Slow

Answers (2)

Related Questions