Reputation: 84
(This is a sample xml file with 2 row tags(any approach to load this to dataframe with n numbers of row tags or elements using xpath in spark scala)
<book id="0">
<author>Matthew</author>
<publish_date>Sun Oct 01 00:00:00 EDT 2000</publish_date>
<description>An in-depth look at creating applications with XML.
</description>
<price id = "1">
<price>44.95</price>
<genre>Computer</genre>
<title>XML Developer's Guide</title>
</price>
</book>
Upvotes: 0
Views: 2491
Reputation: 928
You can create schema for above xml file as below.
val innerSchema = StructType(
StructField("price",
ArrayType(
StructType(
StructField("price",LongType,true)::
StructField("genre",StringType,true)::Nil
StructField("title",StringType,true)::Nil
)
),true)::Nil
)
val schema = StructType(
StructField("author",StringType,true)::
StructField("publish_date", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("price", innerSchema, true):: Nil
)
Apply this schema to read xml file
val df = spark.sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "Item")
.schema(schema)
.load(xmlFile)
//Selecy nested field and explode to get the flattern result
.select("author", "publish_date", "description","price.*") // select
required column
You can let the spark to infer schema itself will get the same result
val df = spark.sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "Item")
//.schema(schema)
.load(xmlFile)
.select("author", "publish_date", "description","price.*") // select required column
Save data to Hive table.
df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");
Even you can save DF to Database as below.
//create properties object
val prop = new java.util.Properties
prop.setProperty("driver", "com.mysql.jdbc.Driver")
prop.setProperty("user", "root")
prop.setProperty("password", "pw")
//jdbc mysql url - destination database is named "data"
val url = "jdbc:mysql://localhost:3306/data"
//destination database table
val table = "sample_data_table"
//write data from spark dataframe to database
df.write.mode("append").jdbc(url, table, prop)
Update :
To save DataFrame as csv
You can use databricks spark-csv. https://github.com/databricks/spark-csv
df.write.format("com.databricks.spark.csv").save(filepath)
With Spark 2.x the spark-csv package is not needed as it's included in Spark.
df.write.format("csv").save(filepath)
Please refer https://github.com/databricks/spark-xml. Hope it helps!
Upvotes: 2