Mikel Urkia
Mikel Urkia

Reputation: 2095

Spark Java error NoClassDefFoundError on Amazon Elastic MapReduce

I am trying to implement and run a Spark application on Amazon's Elastic MapReduce (EMR). So far I have been able to deploy and run a cluster with a "Spark Installation" bootstrap action using the following link:

s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh

The script can be accessed from this internet direction.

In order to upload the .jar application to the cluster, I have created a step configuration as follows:

 HadoopJarStepConfig customConfig = new HadoopJarStepConfig()
                 .withJar("s3://mybucket/SparkApp.jar")
                 .withMainClass("SparkApp.java");

 StepConfig customJarStep = new StepConfig()
                 .withName("Run custom jar")                                                                                    
                 .withActionOnFailure(ActionOnFailure.CONTINUE)
                 .withHadoopJarStep(customConfig);

Finally, the following code shows the actual Spark application extracted from the wordcount example provided by the Spark team (For the 0.8.1 version). As you may notice, the code imports different spark libraries to be able to run the application. The libraries are:

spark-core_2.9.3-0.8.1-incubating.jar (and) scala-library-2.9.3.jar

import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import java.util.Arrays;
import java.util.List;

public class SparkApp {
  public static void main(String[] args) throws Exception {

    JavaSparkContext ctx = new JavaSparkContext("local", "JavaWordCount",
        System.getenv("SPARK_HOME"), System.getenv("SPARK_EXAMPLES_JAR"));
    JavaRDD<String> lines = ctx.textFile("c3://murquiabucket/wordcount.txt", 1);

    JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
      public Iterable<String> call(String s) {
        return Arrays.asList(s.split(" "));
      }
    });

    JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
      public Tuple2<String, Integer> call(String s) {
        return new Tuple2<String, Integer>(s, 1);
      }
    });

    JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
      public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
      }
    });

    List<Tuple2<String, Integer>> output = counts.collect();
    for (Tuple2 tuple : output) {
      System.out.println(tuple._1 + ": " + tuple._2);
    }
    System.exit(0);
  }
}

The problem comes when I try to execute the jar (I made a fat jar to embed the necessary libraries) in the EMR cluster. The application terminates unsuccessfully due to the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mesos/Scheduler at java.lang.ClassLoader.defineClass1(Native Method) ...

For what I understand, there is a issue with Mesos, which I am not able to understand. If this information helps, this is the information of the EMR cluster:

Upvotes: 1

Views: 1772

Answers (1)

Mikel Urkia
Mikel Urkia

Reputation: 2095

As pointed out by @samthebest on the comments above, the error was in fact due to a mismatched error of the Spark version on EMR and my application.

What I learned from this error is that it is very important to check that all the libraries and applications used in the execution of a custom application use the same versions as the cluster.

Upvotes: 1

Related Questions