Nira
Nira

Reputation: 469

Dataproc dependency conflict - google-api-client

I'm building a library for fetching encrypted secrets from cloud storage (in Scala, using the Java clients). I'm using the following google libraries:

"com.google.apis"  % "google-api-services-cloudkms" % "v1-rev26-1.23.0" exclude("com.google.guava", "guava-jdk5"),
"com.google.cloud" % "google-cloud-storage"         % "1.14.0",

Everything works fine locally, but when I try to run my code in Dataproc I'm getting the following error:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder;
    at com.google.api.services.cloudkms.v1.CloudKMS$Builder.setBatchPath(CloudKMS.java:4250)
    at com.google.api.services.cloudkms.v1.CloudKMS$Builder.<init>(CloudKMS.java:4229)
    at gcp.encryption.EncryptedSecretsUser$class.clients(EncryptedSecretsUser.scala:111)
    at gcp.encryption.EncryptedSecretsUser$class.getEncryptedSecrets(EncryptedSecretsUser.scala:62)

The offending line in my code is:

val kms: CloudKMS = new CloudKMS.Builder(credential.getTransport,
      credential.getJsonFactory,
      credential)
      .setApplicationName("Encrypted Secrets User")
      .build()

I see in the documentation that some google libraries are available on Dataproc (I'm using a Spark cluster with image version 1.2.15). But as far as I can see the transitive dependency for google-api-client is the same one I'm using locally (1.23.0). So how come the method isn't found?

Should I set up my dependencies differently for running on Dataproc?

EDIT

Finally managed to solve this in another project. Turns out that besides shading all the google dependencies (including the gcs-connector!!), you also have to register your shaded class with the JVM to handle the gs:// file-system. Below is the maven configuration that works for me, something similar can be achieved with sbt:

Parent POM:

<project xmlns="http://maven.apache.org/POM/4.0.0"...>
...
<properties>
    <!-- Spark version -->
    <spark.version>[2.2.1]</spark.version>
    <!-- Jackson-libs version pulled in by spark -->
    <jackson.version>[2.6.5]</jackson.version>
    <!-- Avro version pulled in by jackson -->
    <avro.version>[1.7.7]</avro.version>
    <!-- Kryo-shaded version pulled in by spark -->
    <kryo.version>[3.0.3]</kryo.version>
    <!-- Apache commons-lang version pulled in by spark -->
    <commons.lang.version>2.6</commons.lang.version>

    <!-- TODO: need to shade google libs because of version-conflicts on Dataproc. Remove this when Dataproc 1.3/2.0 is released -->
    <bigquery-conn.version>[0.10.6-hadoop2]</bigquery-conn.version>
    <gcs-conn.version>[1.6.5-hadoop2]</gcs-conn.version>
    <google-storage.version>[1.29.0]</google-storage.version>
    <!-- The guava version we want to use -->
    <guava.version>[23.2-jre]</guava.version>
    <!-- The google api version used by the google-cloud-storage lib -->
    <api-client.version>[1.23.0]</api-client.version>
    <!-- The google-api-services-storage version used by the google-cloud-storage lib -->
    <storage-api.version>[v1-rev114-1.23.0]</storage-api.version>

    <!-- Picked up by compiler and resource plugins -->
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

...

<build>
    <pluginManagement>
        <plugins>
...

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <minimizeJar>true</minimizeJar>
                        <filters>
                            <filter>
                                <artifact>com.google.**:*</artifact>
                                <includes>
                                    <include>**</include>
                                </includes>
                            </filter>
                            <filter>
                                <artifact>com.google.cloud.bigdataoss:gcs-connector</artifact>
                                <excludes>
                                    <!-- Register a provider with the shaded name instead-->
                                    <exclude>META-INF/services/org.apache.hadoop.fs.FileSystem</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <artifactSet>
                            <includes>
                                <include>com.google.*:*</include>
                            </includes>
                            <excludes>
                                <exclude>com.google.code.findbugs:jsr305</exclude>
                            </excludes>
                        </artifactSet>
                        <relocations>
                            <relocation>
                                <pattern>com.google</pattern>
                                <shadedPattern>com.shaded.google</shadedPattern>
                            </relocation>
                        </relocations>
                    </configuration>
                </execution>
            </executions>
        </plugin>
...
    </plugins>
</build>

<dependencyManagement>
    <dependencies>
        <dependency>
...
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcs-connector</artifactId>
            <version>${gcs-conn.version}</version>
            <exclusions>
                <!-- conflicts with Spark dependencies -->
                <exclusion>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-common</artifactId>
                </exclusion>
                <!-- conflicts with Spark dependencies -->
                <exclusion>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-mapreduce-client-core</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <!-- Avoid conflict with the version pulled in by the GCS-connector on Dataproc -->
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-storage</artifactId>
            <version>${storage-api.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>${commons.lang.version}</version>
        </dependency>
        <dependency>
            <groupId>com.esotericsoftware</groupId>
            <artifactId>kryo-shaded</artifactId>
            <version>${kryo.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>${jackson.version}</version>
        </dependency>
        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client</artifactId>
            <version>${api-client.version}</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>${guava.version}</version>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-storage</artifactId>
        <version>${google-storage.version}</version>
        <exclusions>
            <!-- conflicts with Spark dependencies -->
            <exclusion>
                <groupId>com.google.guava</groupId>
                <artifactId>guava</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
    </dependency>
...
</dependencies>

...
</project>

Child POM:

    <dependencies>
    <!-- Libraries available on dataproc -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>com.google.cloud.bigdataoss</groupId>
        <artifactId>gcs-connector</artifactId>
    </dependency>
    <dependency>
        <groupId>com.esotericsoftware</groupId>
        <artifactId>kryo-shaded</artifactId>
        <scope>provided</scope><!-- Pulled in by spark -->
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <scope>provided</scope><!-- Pulled in by spark -->
    </dependency>
</dependencies>

And add a file named org.apache.hadoop.fs.FileSystem under path/to/your-project/src/main/resources/META-INF/services, containing the name of your shaded class, e.g:

# WORKAROUND FOR DEPENDENCY CONFLICTS ON DATAPROC
#
# Use the shaded class as a provider for the gs:// file system
#

com.shaded.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

(Notice that this file was filtered out of the gcs-connector library in the parent POM)

Upvotes: 1

Views: 2946

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10677

It may not be obvious, but the google-api-client version in the latest stable GCS connector is actually 1.20.0.

The reason is that this was the commit which rolled the api client version forward to 1.23.0, and it was part of a series of commits including this dependency-shading commit with the overall goal of no longer leaking the transitive dependency into the job classpath at all, precisely to avoid version collision issues in the future, at the cost of everyone having to bring their own fat jar containing the full api client dependencies themselves.

However, it turns out that many people have already grown to depend on the GCS-connector-provided api client to be on the classpath, so there are production workloads out there which cannot survive such a change inside of a minor version upgrade; thus, the upgraded GCS connector which uses 1.23.0 but also shades it so that it won't appear in the job classpath anymore is reserved for a future Dataproc 1.3+ or 2.0+ release.

In your case, you could try using a 1.20.0 version of your dependencies (you may also have to downgrade the version of the google-cloud-storage dependency you included, though a 1.22.0 version of that may still work assuming no breaking changes, since setBatchPath was indeed introduced only in 1.23.0), or otherwise you can try to shade all your own dependencies using sbt-assembly.

We can verify that setBatchPath was introduced only in 1.23.0:

$ javap -cp google-api-client-1.22.0.jar com.google.api.client.googleapis.services.AbstractGoogleClient.Builder | grep set
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setRootUrl(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setServicePath(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setGoogleClientRequestInitializer(com.google.api.client.googleapis.services.GoogleClientRequestInitializer);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setHttpRequestInitializer(com.google.api.client.http.HttpRequestInitializer);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setApplicationName(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressPatternChecks(boolean);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressRequiredParameterChecks(boolean);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressAllChecks(boolean);

$ javap -cp google-api-client-1.23.0.jar com.google.api.client.googleapis.services.AbstractGoogleClient.Builder | grep set
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setRootUrl(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setServicePath(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setBatchPath(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setGoogleClientRequestInitializer(com.google.api.client.googleapis.services.GoogleClientRequestInitializer);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setHttpRequestInitializer(com.google.api.client.http.HttpRequestInitializer);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setApplicationName(java.lang.String);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressPatternChecks(boolean);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressRequiredParameterChecks(boolean);
  public com.google.api.client.googleapis.services.AbstractGoogleClient$Builder setSuppressAllChecks(boolean);

Upvotes: 5

Related Questions