Reputation: 5771
I am a spark amateur as you will notice in the question. I am trying to run very basic code on a spark cluster. (created on dataproc)
Create a pyspark shell with pyspark --master yarn
and run the code - Success
Run the exact same code with spark-submit --master yarn code.py
- Fails
I have provided some basic details below. Please do let me know whatever additional details I might provide for you to help me.
Details:
code to be run :
testing_dep.py
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100),numSlices=10).collect()
print(rdd)
Running with pyspark shell
pyspark --master yarn
output:
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-01-07 20:45:54,608 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2022-01-07 20:45:58,195 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used
because libhadoop cannot be loaded.
2022-01-07 20:45:58,357 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Python version 3.9.7 (default, Sep 29 2021 19:20:46)
Spark context Web UI available at http://pyspark32-m.us-central1-b.c.monsoon-credittech.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1641410203571_0040).
SparkSession available as 'spark'.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100),numSlices=10).collect()
>>> print(rdd)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88
, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
Running with spark-submit
spark-submit --master yarn gs://monsoon-credittech.appspot.com
/testing_dep.py
output:
2022-01-07 20:48:37,310 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.had
oop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1938)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:780)
at org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
... 19 more
Upvotes: 2
Views: 938
Reputation: 542
I think the error message is clear:
Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
You need to add the Jar file which contains the above class to SPARK_CLASSPATH
Please see Issues Google Cloud Storage connector on Spark or DataProc for complete solutions.
Upvotes: 1