user0204
user0204

Reputation: 261

Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

I want to read some parquet files present in a folder poc/folderName on s3 bucket myBucketName to a pyspark dataframe. I am using pyspark v2.4.3 for the same.

below is the code which i am using

    sc = SparkContext.getOrCreate()
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
    sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
    sqlContext = SQLContext(sc)
    parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")

I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.

An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)

What is the thing that I am doing wrong here? I am running the python code using Anaconda and spyder on windows 10

Upvotes: 1

Views: 1711

Answers (1)

Horatiu Jeflea
Horatiu Jeflea

Reputation: 7424

The Maven coordinates for the open source Hadoop S3 driver need to be added as a package dependency:

spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.0

Note the above package version is tied to the installed AWS SDK for Java version.

In the Spark application's code, something like the following may also be needed:

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

Note that when using the open source Hadoop driver, the S3 URI scheme is s3a and not s3 (as it is when using Spark on EMR and Amazon's proprietary EMRFS). e.g. s3a://bucket-name/

Credits to danielchalef

Upvotes: 1

Related Questions