Thiyagu
Thiyagu

Reputation: 133

Unable to read XML files from Azure Blob container using Pyspark

I am trying to read multiple XML files from Azure blob container using Pyspark. When I am running the script in Azure Synapse notebook, I am getting below error.

Note:

  1. I have tested the connection using Azure Data Lake Gen 2 linked services (both to linked service and to file path)
  2. I have added my workspace under 'Role Assignments' and given 'Storage Blob Data Contributor' role

Pyspark code:

Throws error at the below line

df = spark.read.format("xml").options(rowTag="x",inferSchema = True).load(xmlfile.path)

Assumption:

Assumption is that I don't have read permission to the XML files, but I am not sure if I am missing anything. Can you please throw some light?

Py4JJavaError: An error occurred while calling o2333.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 3.0 failed 4 times, most recent failure: Lost task 26.3 in stage 3.0 (TID 288) : java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://<*path/x.xml*>
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1185)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:200)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:187)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:912)
    at com.databricks.spark.xml.XmlRecordReader.initialize(XmlInputFormat.scala:86)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:240)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:237)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:192)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:91)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, <path/x.xml>
    at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
    at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathStatus(AbfsClient.java:570)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:627)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:196)
    ... 23 more

Upvotes: 0

Views: 732

Answers (1)

Venkatesan
Venkatesan

Reputation: 10490

Operation failed: "This request is not authorized to perform this operation using this permission.", 403

The error occurs when you don't have proper role assigned to access storage account. Please make sure you have "storage blob contributor" role

I tried in my environment with same process without role assigned myself. I got similar error:

enter image description here

When I added role assignements "storage blob contributor" to my user account and run the code, got file successfully .

enter image description here

Output:

enter image description here

Reference:

Error: This request is not authorized to perform this operation using this permission.", 403 in Azure synapse notebook while running from pyspark - Microsoft Q&A -Pradeepcheekatala-MSFT

Upvotes: 0

Related Questions