Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

Question

I want to read some parquet files present in a folder poc/folderName on s3 bucket myBucketName to a pyspark dataframe. I am using pyspark v2.4.3 for the same.

below is the code which i am using

    sc = SparkContext.getOrCreate()
    sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
    sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
    sqlContext = SQLContext(sc)
    parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")

I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.

An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)

What is the thing that I am doing wrong here? I am running the python code using Anaconda and spyder on windows 10

Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

Answers (1)

Related Questions