Rafiul Sabbir
Rafiul Sabbir

Reputation: 636

Pyspark read all JSON files from a subdirectory of S3 bucket

I am trying to read JSON files from a subdirectory called world from a S3 bucket named hello. When I list all the objects of that directory using boto3, I can see several part files(which were possibly created by a spark job) like below.

world/
world/_SUCCESS
world/part-r-00000-....json
world/part-r-00001-....json
world/part-r-00002-....json
world/part-r-00003-....json
world/part-r-00004-....json
world/part-r-00005-....json
world/part-r-00006-....json
world/part-r-00007-....json

I have written the following code to read all these files.

spark_session = SparkSession
            .builder
            .config(
            conf=SparkConf().setAll(spark_config).setAppName(app_name)
        ).getOrCreate()
hadoop_conf = spark_session._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", "my-aws-access-key")
hadoop_conf.set("fs.s3a.secret.key", "my-aws-secret-key")
hadoop_conf.set("com.amazonaws.services.s3a.enableV4", "true")

df = spark_session.read.json("s3a://hello/world/")

and getting the following error

py4j.protocol.Py4JJavaError: An error occurred while calling o98.json.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: , AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: 
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:557)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:392)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)

I have tried with "s3a://hello/world/*" and "s3a://hello/world/*.json"as well but still getting the same error.

FYI, I am using the following versions of the tools:

pyspark 2.4.5
com.amazonaws:aws-java-sdk:1.7.4
org.apache.hadoop:hadoop-aws:2.7.1
org.apache.hadoop:hadoop-common:2.7.1

Can anyone help me with this?

Upvotes: 2

Views: 960

Answers (1)

Aditya Vikram Singh
Aditya Vikram Singh

Reputation: 476

it seems that the credentials you are using to access the bucket/ folder doesn't have required access right .

Please check the following things

  1. Credentials or role specified in your application code
  2. Policy attached to the Amazon Elastic Compute Cloud (Amazon EC2) instance profile role
  3. Amazon S3 VPC endpoint policy
  4. Amazon S3 source and destination bucket policies

Few things which you can use to debug quickly on your master node of the cluster try to access the bucket using

aws s3 ls s3://hello/world/

if this throws the error try to resolve the access control by following this link https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-403-access-denied/

Upvotes: 2

Related Questions