AJR
AJR

Reputation: 589

Unable to load xml files using spark-xml

Can someone please help me understand why I'm not able to successfully load my xml file from s3 using spark-xml.

I have downloaded spark-xml jar file and added the path within the job details under "Dependent JARs path" within aws glue. What I added is "s3://aws-glue-bucket/jar_files/spark-xml_2.13-0.16.0.jar". Yes, this jar is in the location mentioned.

Code below:

import sys
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
from pyspark import SparkContext, SparkConf
from awsglue.utils import getResolvedOptions
from pyspark.sql.functions import when
from pyspark.sql.window import *
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

conf = SparkConf()
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
print("Spark Session Created ... good")

df = spark.read \
    .format("xml") \
    .option("rowTag", "person") \
    .load("s3://aws-glue-bucket/xml_files/persons.xml")

df.printSchema()
df.show(3)

The Error is: An error occurred while calling o87.load.

2023-02-21 02:27:50,881 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
  File "/tmp/XmlTestingJob.py", line 25, in <module>
    .load("s3://aws-glue-bucket/xml_files/persons.xml")
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load
    return self._df(self._jreader.load(path))
  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o87.load.
: java.lang.NoClassDefFoundError: scala/Product$class
    at com.databricks.spark.xml.util.PermissiveMode$.<init>(ParseMode.scala:33)
    at com.databricks.spark.xml.util.PermissiveMode$.<clinit>(ParseMode.scala)
    at com.databricks.spark.xml.XmlOptions$$anonfun$26.apply(XmlOptions.scala:57)
    at com.databricks.spark.xml.XmlOptions$$anonfun$26.apply(XmlOptions.scala:57)
    at scala.collection.MapLike.getOrElse(MapLike.scala:127)
    at scala.collection.MapLike.getOrElse$(MapLike.scala:125)
    at org.apache.spark.sql.catalyst.util.BaseCaseInsensitiveMap.getOrElse(CaseInsensitiveMap.scala:69)
    at com.databricks.spark.xml.XmlOptions.<init>(XmlOptions.scala:57)
    at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:76)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Thank you in advance.

Upvotes: 1

Views: 1020

Answers (0)

Related Questions