Desanth pv
Desanth pv

Reputation: 351

How to load all xml files from a Hdfs directory using spark databricks xml parser

How to get a single dataframe of all xml files in a Hdfs directory, which having same xml schema using databricks xml parser

Upvotes: 2

Views: 5801

Answers (3)

Davos
Davos

Reputation: 5415

You can do this using globbing. See the Spark dataframeReader load method. load can take a single path string, a sequence of paths, or no argument for datasouces that don't have paths (i.e. not HDFS or S3 or other file system). http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

val df = sqlContext.read.format("com.databricks.spark.xml")
.option("inferschema","true")
.option("rowTag", "address") //the root node of your xml to be treated as row
.load("/path/to/files/*.xml")

load can take a long string with comma separated paths

.load("/path/to/files/File1.xml, /path/to/files/File2.xml")

Or similar to this answer Reading multiple files from S3 in Spark by date period

You can also use a sequence of paths

val paths: Seq[String] = ...
val df = sqlContext.read.load(paths: _*)

Note that the inferschema is pretty hectic for XML. I've not had a lot of success when there are a lot of files involved. Specifying schema works better. If you can guarantee that your XML files all have the same schema, you could use a small sample of them to infer schema and then load the rest of them in. I think that's not safe though, because XML can still be "valid" even if it is missing some nodes or elements with regard to an XSD.

Upvotes: 4

khushbu kanojia
khushbu kanojia

Reputation: 250

Setup your maven for databricks dependencies as

https://mvnrepository.com/artifact/com.databricks/spark-xml_2.10/0.2.0

Then use below code in your spark program to read HDFS xml files and create a single dataframe

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)

val df = sqlContext.read .format("com.databricks.spark.xml")

.option("rowTag", "address")  //The row tag of your xml files to treat as a row

.load("file.xml")

val selectedResult = df.select("city", "zipcode")

selectedResult.write

.format("com.databricks.spark.xml")

.option("rootTag", "address") //The root tag of your xml files to treat as the root

.option("rowTag", "address")

.save("result.xml")

Find complete example in github:

https://github.com/databricks/spark-xml/blob/master/README.md

Upvotes: 0

BalaramRaju
BalaramRaju

Reputation: 439

I see that you want to read XML data by reading each xml separately and process them individually.below is a skeleton as to how it will look.

import scala.xml.XML

val rdd1 = sc.wholeTextFiles("/data/tmp/test/*")

val xml = rdd1.map(x=>XML.loadString(_._2.toString())

Upvotes: 0

Related Questions