Reputation: 261
I want to read some parquet files present in a folder poc/folderName
on s3 bucket myBucketName
to a pyspark dataframe. I am using pyspark v2.4.3 for the same.
below is the code which i am using
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
sqlContext = SQLContext(sc)
parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")
I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.
An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
What is the thing that I am doing wrong here? I am running the python code using Anaconda and spyder on windows 10
Upvotes: 1
Views: 1711
Reputation: 7424
The Maven coordinates for the open source Hadoop S3 driver need to be added as a package dependency:
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.0
Note the above package version is tied to the installed AWS SDK for Java version.
In the Spark application's code, something like the following may also be needed:
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
Note that when using the open source Hadoop driver, the S3 URI scheme is s3a and not s3 (as it is when using Spark on EMR and Amazon's proprietary EMRFS). e.g. s3a://bucket-name/
Credits to danielchalef
Upvotes: 1