Harry Leboeuf
Harry Leboeuf

Reputation: 745

How can I read a XML file Azure Databricks Spark

I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW. Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported.. So any help pushing me a a good direction is appreciated.

Upvotes: 4

Views: 13634

Answers (3)

Raman gupta
Raman gupta

Reputation: 1

I got one solution of reading xml file in databricks:

install this library : com.databricks:spark-xml_2.12:0.11.0 using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.

Using this command (%fs head "") you will get the rootTag and rowTag.

df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")

display(df) reference image for solution to read xml file in databricks

Upvotes: 0

soMuchToLearnAndShare
soMuchToLearnAndShare

Reputation: 1035

I found this one is really helpful. https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb

he has a youtube to walk through the steps as well.

in summary, 2 approaches:

  1. install in your databricks cluster at the 'library' tab.
  2. install it via launching spark-shell in the notebook itself.

Upvotes: 1

jegordon
jegordon

Reputation: 1287

One way is to use the databricks spark-xml library :

  1. Import the spark-xml library into your workspace https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
  2. Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
  3. Use the following code in your notebook to read the xml file, where "note" is the root of my xml file.

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

Example :

Example

Upvotes: 4

Related Questions