Reputation: 3283
I am trying to run a simple spark script in a dataproc cluster, that needs to read/write to a gcs bucket using scala and the java Cloud Storage Client Libraries. The script is the following:
//Build.sbt
name := "trialGCS"
version :="0.0.1"
scalaVersion := "2.12.10"
val sparkVersion = "3.0.1"
libraryDependencies ++= Seq(
// Spark core libraries
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.google.cloud" % "google-cloud-storage" % "1.113.15"
)
resolvers += Resolver.mavenLocal
package DEV
import com.google.cloud.storage.StorageOptions
object TrialGCS extends App {
import spark.implicits._
val storage = StorageOptions.getDefaultInstance.getService
}
I launch the script via terminal with the shell command:
gcloud dataproc jobs submit spark --class DEV.TrialGCS --jars target/scala-2.12/trialgcs_2.12-0.0.1.jar --cluster <CLUSTERNAME> --region=<REGIONNAME>
However this produces the error java.lang.NoClassDefFoundError: com/google/cloud/storage/Storage
.
If I include the cloudstorage jar manually, changing --jars
in the previous command with
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar,google-cloud-storage-1.113.15.jar
the error is now java.lang.NoClassDefFoundError: com/google/cloud/Service
.
So, apparently it's a matter of missing libraries.
On the other hand if I use spark-shell --packages "com.google.cloud:google-cloud-storage:1.113.15"
via ssh in the dataproc driver's vm all works perfectly.
How to solve this issue?
Upvotes: 2
Views: 708
Reputation: 3283
I've found the solution: to manage properly the package dependence, the google-cloud-storage library needs to be included via --properties=spark.jars.packages=<MAVEN_COORDINATES>
, as shown in https://cloud.google.com/dataproc/docs/guides/manage-spark-dependencies . In my case this means
gcloud dataproc jobs submit spark --class DEV.TrialGCS \
--jars target/scala-2.12/trialgcs_2.12-0.0.1.jar \
--cluster <CLUSTERNAME> --region=<REGIONNAME> \
--properties=spark.jars.packages="com.google.cloud:google-cloud-storage:1.113.15"
When multiple maven coordinates for packages or multiple properties are necessary, it is necessary to escape the string: https://cloud.google.com/sdk/gcloud/reference/topic/escaping
For instance, for google-cloud-storage and kafka:
--properties=^#^spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,io#spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar#spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar
Upvotes: 1
Reputation: 561
If it's ensured that you've the dependent jar in the driver machine, you can add the jars in the class path explicitly. You can try by the following command,
gcloud dataproc jobs submit spark \
--class DEV.TrialGCS \
--properties spark.driver.extraClassPath=<comma seperated full path of jars>,spark.jars.packages=com.google.cloud:google-cloud-storage:1.113.15 \
--cluster <CLUSTERNAME> --region=<REGIONNAME>
Upvotes: 1