Reputation: 636
I am trying to read JSON files from a subdirectory called world
from a S3 bucket named hello
. When I list all the objects of that directory using boto3, I can see several part files(which were possibly created by a spark job) like below.
world/
world/_SUCCESS
world/part-r-00000-....json
world/part-r-00001-....json
world/part-r-00002-....json
world/part-r-00003-....json
world/part-r-00004-....json
world/part-r-00005-....json
world/part-r-00006-....json
world/part-r-00007-....json
I have written the following code to read all these files.
spark_session = SparkSession
.builder
.config(
conf=SparkConf().setAll(spark_config).setAppName(app_name)
).getOrCreate()
hadoop_conf = spark_session._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", "my-aws-access-key")
hadoop_conf.set("fs.s3a.secret.key", "my-aws-secret-key")
hadoop_conf.set("com.amazonaws.services.s3a.enableV4", "true")
df = spark_session.read.json("s3a://hello/world/")
and getting the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o98.json.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: , AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID:
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:557)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:392)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
I have tried with "s3a://hello/world/*"
and "s3a://hello/world/*.json"
as well but still getting the same error.
FYI, I am using the following versions of the tools:
pyspark 2.4.5
com.amazonaws:aws-java-sdk:1.7.4
org.apache.hadoop:hadoop-aws:2.7.1
org.apache.hadoop:hadoop-common:2.7.1
Can anyone help me with this?
Upvotes: 2
Views: 960
Reputation: 476
it seems that the credentials you are using to access the bucket/ folder doesn't have required access right .
Please check the following things
Few things which you can use to debug quickly on your master node of the cluster try to access the bucket using
aws s3 ls s3://hello/world/
if this throws the error try to resolve the access control by following this link https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-403-access-denied/
Upvotes: 2