Reputation: 745
I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW. Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it imported.. So any help pushing me a a good direction is appreciated.
Upvotes: 4
Views: 13634
Reputation: 1
I got one solution of reading xml file in databricks:
install this library : com.databricks:spark-xml_2.12:0.11.0 using this (10.5 (includes Apache Spark 3.2.1, Scala 2.12)) cluster configuration.
Using this command (%fs head "") you will get the rootTag and rowTag.
df = spark.read.format('xml').option("rootTag","orders").option("rowTag","purchase_item").load("dbfs:/databricks-datasets/retail-org/purchase_orders/purchase_orders.xml")
display(df) reference image for solution to read xml file in databricks
Upvotes: 0
Reputation: 1035
I found this one is really helpful. https://github.com/raveendratal/PysparkTelugu/blob/master/Read_Write_XML_File.ipynb
he has a youtube to walk through the steps as well.
in summary, 2 approaches:
Upvotes: 1
Reputation: 1287
One way is to use the databricks spark-xml library :
xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')
Example :
Upvotes: 4