Cloud Storage Client with Scala and Dataproc: missing libraries

Question

I am trying to run a simple spark script in a dataproc cluster, that needs to read/write to a gcs bucket using scala and the java Cloud Storage Client Libraries. The script is the following:

//Build.sbt
name := "trialGCS"

version :="0.0.1"
scalaVersion := "2.12.10"

val sparkVersion = "3.0.1"


libraryDependencies ++= Seq(
  // Spark core libraries
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,

  "com.google.cloud" % "google-cloud-storage" % "1.113.15"
)
resolvers += Resolver.mavenLocal

package DEV

import com.google.cloud.storage.StorageOptions

object TrialGCS extends App {
  import spark.implicits._
  val storage = StorageOptions.getDefaultInstance.getService

}

I launch the script via terminal with the shell command:

gcloud dataproc jobs submit spark --class DEV.TrialGCS --jars target/scala-2.12/trialgcs_2.12-0.0.1.jar --cluster  --region=

However this produces the error java.lang.NoClassDefFoundError: com/google/cloud/storage/Storage.

If I include the cloudstorage jar manually, changing --jars in the previous command with

--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar,google-cloud-storage-1.113.15.jar

the error is now java.lang.NoClassDefFoundError: com/google/cloud/Service.

So, apparently it's a matter of missing libraries.

On the other hand if I use spark-shell --packages "com.google.cloud:google-cloud-storage:1.113.15" via ssh in the dataproc driver's vm all works perfectly.

How to solve this issue?

Galuoises · Accepted Answer

I've found the solution: to manage properly the package dependence, the google-cloud-storage library needs to be included via --properties=spark.jars.packages=, as shown in https://cloud.google.com/dataproc/docs/guides/manage-spark-dependencies . In my case this means

gcloud dataproc jobs submit spark --class DEV.TrialGCS \
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar \
--cluster  --region= \
--properties=spark.jars.packages="com.google.cloud:google-cloud-storage:1.113.15"

When multiple maven coordinates for packages or multiple properties are necessary, it is necessary to escape the string: https://cloud.google.com/sdk/gcloud/reference/topic/escaping

For instance, for google-cloud-storage and kafka:

--properties=^#^spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,io#spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar#spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar

Cloud Storage Client with Scala and Dataproc: missing libraries

Answers (2)

Related Questions