Osama Abdulsattar
Osama Abdulsattar

Reputation: 576

How to resolve Spark library conflict with Cloudera CDH 5.8.0 virtual box

I am trying to submit a job to spark in Cloudera CDH 5.8.0 virtual box, and I am using json library, and I use also maven-shade plugin to include the dependency to jar file, following is my pom:

<project>
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>spark</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <dependencies>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>1.5.1</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20160810</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <finalName>uber-${project.artifactId}-${project.version}</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>


</project>

Submit command is:

spark-submit --class com.example.spark.SparkParser --master local[*] uber-spark-0.0.1-SNAPSHOT.jar 

And I keep getting following exception:

Exception in thread "main" java.lang.NoSuchMethodError:
org.json.JSONTokener.<init>(Ljava/io/InputStream;)

I found a small following code that can tell from which library the class is loaded:

ClassLoader classloader = org.json.JSONTokener.class.getClassLoader();
URL res = classloader.getResource("org/json/JSONTokener.class");
String path = res.getPath();
System.out.println("Core JSONTokener came from " + path);

And the output is as the following:

Core JSONTokener came from file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.8.0.jar!/org/json/JSONTokener.class

I can find the file locally in the virtual box of CDH as following:

[cloudera@quickstart ~]$ ls -l /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.8.0.jar
-rw-r--r-- 1 root root 19306194 Jun 16  2016 /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.8.0.jar

I even tried to make the json library as 'provided' to exclude it from my jar file, but still the same Error.

I tried to remove the local jar file named: /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.8.0.jar And my code works correctly, but I am not sure this is the correct solution, and if removing this library would hurt cloudera somehow.

So, how can I tell spark not use this local jar file, and use the one included inside my 'uber-spark-0.0.1-SNAPSHOT.jar' file ?

Upvotes: 0

Views: 775

Answers (1)

LoopBit
LoopBit

Reputation: 46

Not sure why no one has answered you before...

Your issue is that you have two different versions of the same library in the runtime classpath. One included in your jar and another one being added by Cloudera. There's a method in JSONTokener that is different between the two versions (maybe it doesn't exist on one version or the signature has changed), you are using one version in your code (and that's why your code compiles) but during runtime the ClassLoader is using the other.

The short answer to your question is that you can't: The Java ClassLoader loads all the libraries in the path and when you load a class, it loads the first one it finds. In this case, the one provided by the Hive runtime.

Longer answer: Your only option to force the use of the jar included with your app is to edit the spark defaults so that it doesn't include Hive. Now, I'm not entirely sure on how to do that in your case, but I'd probably look in /etc/spark/spark-defaults.conf, try to disable Hive or maybe something inside Cloudera Manager is the way to go.

A better option is to remove the jar from your project, add the Cloudera Maven repository to your pom and include the hive-exec-1.1.0-cdh5.8.0 as a provided dependency, see Using the CDH 5 Maven repository for more info on how to do this.

Hope this helps.

Upvotes: 1

Related Questions