Reputation: 1645
Below are the runtime versions in pycharm.
Java Home /Library/Java/JavaVirtualMachines/jdk-11.0.16.1.jdk/Contents/Home
Java Version 11.0.16.1 (Oracle Corporation)
Scala Version version 2.12.15
Spark Version. spark-3.3.1
Python 3.9
I am trying to write a pyspark dataframe to csv as below:
df.write.csv("/Users/data/data.csv")
and gets the error:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/readwriter.py", line 1240, in csv
self._jwrite.csv(path)
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o747.csv.
: java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
And spark conf is as below:
spark_conf = SparkConf()
spark_conf.setAll(parameters.items())
spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
spark_conf.set('spark.hadoop.fs.s3.aws.credentials.provider',
'org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider')
spark_conf.set('spark.hadoop.fs.s3.access.key', os.environ.get('AWS_ACCESS_KEY_ID'))
spark_conf.set('spark.hadoop.fs.s3.secret.key', os.environ.get('AWS_SECRET_ACCESS_KEY'))
spark_conf.set('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true')
spark_conf.set("com.amazonaws.services.s3.enableV4", "true")
spark_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark_conf.set("fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
spark_conf.set("hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("hadoop.fs.s3a.fast.upload", "true")
spark_conf.set("hadoop.fs.s3a.fast.upload.buffer", "bytebuffer")
spark_conf.set("fs.s3a.path.style.access", "true")
spark_conf.set("fs.s3a.multipart.size", "128M")
spark_conf.set("fs.s3a.fast.upload.active.blocks", "4")
spark_conf.set("fs.s3a.committer.name", "partitioned")
spark_conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
spark_conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
spark_conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
spark_conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "1")
Any help to fix this issue is appreciated. Thanks!!
Upvotes: 3
Views: 3153
Reputation: 599
You have to add the required jar to the Spark job's required packages. It won't change anything if you just download it locally. The jar must be added to the job's classpath. In order to do it, you have to change your code from:
spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
to
spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.spark:spark-hadoop-cloud_2.12:3.3.1')
Upvotes: 1
Reputation: 66881
Looks like you do not have the hadoop-cloud
module added. The class is not part of core Spark. https://search.maven.org/artifact/org.apache.spark/spark-hadoop-cloud_2.12/3.3.1/jar
Upvotes: 1