Reputation: 351
How to get a single dataframe of all xml files in a Hdfs directory, which having same xml schema using databricks xml parser
Upvotes: 2
Views: 5801
Reputation: 5415
You can do this using globbing. See the Spark dataframeReader load
method.
load
can take a single path string, a sequence of paths, or no argument for datasouces that don't have paths (i.e. not HDFS or S3 or other file system).
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
val df = sqlContext.read.format("com.databricks.spark.xml")
.option("inferschema","true")
.option("rowTag", "address") //the root node of your xml to be treated as row
.load("/path/to/files/*.xml")
load
can take a long string with comma separated paths
.load("/path/to/files/File1.xml, /path/to/files/File2.xml")
Or similar to this answer Reading multiple files from S3 in Spark by date period
You can also use a sequence of paths
val paths: Seq[String] = ...
val df = sqlContext.read.load(paths: _*)
Note that the inferschema
is pretty hectic for XML. I've not had a lot of success when there are a lot of files involved. Specifying schema works better. If you can guarantee that your XML files all have the same schema, you could use a small sample of them to infer schema and then load the rest of them in. I think that's not safe though, because XML can still be "valid" even if it is missing some nodes or elements with regard to an XSD.
Upvotes: 4
Reputation: 250
Setup your maven for databricks dependencies as
https://mvnrepository.com/artifact/com.databricks/spark-xml_2.10/0.2.0
Then use below code in your spark program to read HDFS xml files and create a single dataframe
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read .format("com.databricks.spark.xml")
.option("rowTag", "address") //The row tag of your xml files to treat as a row
.load("file.xml")
val selectedResult = df.select("city", "zipcode")
selectedResult.write
.format("com.databricks.spark.xml")
.option("rootTag", "address") //The root tag of your xml files to treat as the root
.option("rowTag", "address")
.save("result.xml")
Find complete example in github:
https://github.com/databricks/spark-xml/blob/master/README.md
Upvotes: 0
Reputation: 439
I see that you want to read XML data by reading each xml separately and process them individually.below is a skeleton as to how it will look.
import scala.xml.XML
val rdd1 = sc.wholeTextFiles("/data/tmp/test/*")
val xml = rdd1.map(x=>XML.loadString(_._2.toString())
Upvotes: 0