Reputation: 133
I am trying to read multiple XML files from Azure blob container using Pyspark. When I am running the script in Azure Synapse notebook, I am getting below error.
Note:
Pyspark code:
Throws error at the below line
df = spark.read.format("xml").options(rowTag="x",inferSchema = True).load(xmlfile.path)
Assumption:
Assumption is that I don't have read permission to the XML files, but I am not sure if I am missing anything. Can you please throw some light?
Py4JJavaError: An error occurred while calling o2333.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 3.0 failed 4 times, most recent failure: Lost task 26.3 in stage 3.0 (TID 288) : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://<*path/x.xml*>
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1185)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:200)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:912)
at com.databricks.spark.xml.XmlRecordReader.initialize(XmlInputFormat.scala:86)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:240)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:237)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:192)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:91)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, <path/x.xml>
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus(AbfsClient.java:570)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:627)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:196)
... 23 more
Upvotes: 0
Views: 732
Reputation: 10490
Operation failed: "This request is not authorized to perform this operation using this permission.", 403
The error occurs when you don't have proper role assigned to access storage account. Please make sure you have "storage blob contributor" role
I tried in my environment with same process without role assigned myself. I got similar error:
When I added role assignements "storage blob contributor" to my user account and run the code, got file successfully .
Output:
Reference:
Error: This request is not authorized to perform this operation using this permission.", 403 in Azure synapse notebook while running from pyspark - Microsoft Q&A -Pradeepcheekatala-MSFT
Upvotes: 0