How to fix "java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" Pyspark

Below are the runtime versions in pycharm.

Java Home   /Library/Java/JavaVirtualMachines/jdk-11.0.16.1.jdk/Contents/Home
Java Version    11.0.16.1 (Oracle Corporation)
Scala Version   version 2.12.15
Spark Version.         spark-3.3.1
Python 3.9

I am trying to write a pyspark dataframe to csv as below:

df.write.csv("/Users/data/data.csv")

and gets the error:

     Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o747.csv.
: java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

And spark conf is as below:

spark_conf = SparkConf()
        spark_conf.setAll(parameters.items())
        spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
        spark_conf.set('spark.hadoop.fs.s3.aws.credentials.provider',
                       'org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider')
        spark_conf.set('spark.hadoop.fs.s3.access.key', os.environ.get('AWS_ACCESS_KEY_ID'))
        spark_conf.set('spark.hadoop.fs.s3.secret.key', os.environ.get('AWS_SECRET_ACCESS_KEY'))
        spark_conf.set('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true')
        spark_conf.set("com.amazonaws.services.s3.enableV4", "true")
        spark_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        spark_conf.set("fs.s3a.aws.credentials.provider",
                       "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
        spark_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
        spark_conf.set("hadoop.fs.s3a.path.style.access", "true")
        spark_conf.set("hadoop.fs.s3a.fast.upload", "true")
        spark_conf.set("hadoop.fs.s3a.fast.upload.buffer", "bytebuffer")
        spark_conf.set("fs.s3a.path.style.access", "true")
        spark_conf.set("fs.s3a.multipart.size", "128M")
        spark_conf.set("fs.s3a.fast.upload.active.blocks", "4")
        spark_conf.set("fs.s3a.committer.name", "partitioned")
        spark_conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
        spark_conf.set("spark.sql.sources.commitProtocolClass",
                       "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
        spark_conf.set("spark.sql.parquet.output.committer.class",
                       "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
        spark_conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "1")

Any help to fix this issue is appreciated. Thanks!!

Upvotes: 3

Answers (2)

Boyan

Reputation: 599

You have to add the required jar to the Spark job's required packages. It won't change anything if you just download it locally. The jar must be added to the job's classpath. In order to do it, you have to change your code from:

spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')

spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.spark:spark-hadoop-cloud_2.12:3.3.1')

Upvotes: 1

Sean Owen

Reputation: 66881

Looks like you do not have the hadoop-cloud module added. The class is not part of core Spark. https://search.maven.org/artifact/org.apache.spark/spark-hadoop-cloud_2.12/3.3.1/jar

Upvotes: 1

How to fix &quot;java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol&quot; Pyspark

Answers (2)

Related Questions

How to fix "java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" Pyspark