Galuoises
Galuoises

Reputation: 3283

Cloud Storage Client with Scala and Dataproc: missing libraries

I am trying to run a simple spark script in a dataproc cluster, that needs to read/write to a gcs bucket using scala and the java Cloud Storage Client Libraries. The script is the following:

//Build.sbt
name := "trialGCS"

version :="0.0.1"
scalaVersion := "2.12.10"

val sparkVersion = "3.0.1"


libraryDependencies ++= Seq(
  // Spark core libraries
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,

  "com.google.cloud" % "google-cloud-storage" % "1.113.15"
)
resolvers += Resolver.mavenLocal
package DEV

import com.google.cloud.storage.StorageOptions

object TrialGCS extends App {
  import spark.implicits._
  val storage = StorageOptions.getDefaultInstance.getService

}

I launch the script via terminal with the shell command:

gcloud dataproc jobs submit spark --class DEV.TrialGCS --jars target/scala-2.12/trialgcs_2.12-0.0.1.jar --cluster <CLUSTERNAME> --region=<REGIONNAME>

However this produces the error java.lang.NoClassDefFoundError: com/google/cloud/storage/Storage.

If I include the cloudstorage jar manually, changing --jars in the previous command with

--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar,google-cloud-storage-1.113.15.jar

the error is now java.lang.NoClassDefFoundError: com/google/cloud/Service.

So, apparently it's a matter of missing libraries.

On the other hand if I use spark-shell --packages "com.google.cloud:google-cloud-storage:1.113.15" via ssh in the dataproc driver's vm all works perfectly.

How to solve this issue?

Upvotes: 2

Views: 708

Answers (2)

Galuoises
Galuoises

Reputation: 3283

I've found the solution: to manage properly the package dependence, the google-cloud-storage library needs to be included via --properties=spark.jars.packages=<MAVEN_COORDINATES>, as shown in https://cloud.google.com/dataproc/docs/guides/manage-spark-dependencies . In my case this means

gcloud dataproc jobs submit spark --class DEV.TrialGCS \
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar \
--cluster <CLUSTERNAME> --region=<REGIONNAME> \
--properties=spark.jars.packages="com.google.cloud:google-cloud-storage:1.113.15"

When multiple maven coordinates for packages or multiple properties are necessary, it is necessary to escape the string: https://cloud.google.com/sdk/gcloud/reference/topic/escaping

For instance, for google-cloud-storage and kafka:

--properties=^#^spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,io#spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar#spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar

Upvotes: 1

Md Shihab Uddin
Md Shihab Uddin

Reputation: 561

If it's ensured that you've the dependent jar in the driver machine, you can add the jars in the class path explicitly. You can try by the following command,

gcloud dataproc jobs submit spark  \
--class DEV.TrialGCS \
--properties spark.driver.extraClassPath=<comma seperated full path of jars>,spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15 \
--cluster <CLUSTERNAME> --region=<REGIONNAME>

Upvotes: 1

Related Questions