How to load complex xml files containing more than 1 row tag into dataframe using spark scala and save it as table(note generic solution)

(This is a sample xml file with 2 row tags(any approach to load this to dataframe with n numbers of row tags or elements using xpath in spark scala)

<book id="0">
    <author>Matthew</author>
    <publish_date>Sun Oct 01 00:00:00 EDT 2000</publish_date>
    <description>An in-depth look at creating applications with XML. 
    </description>
 <price id = "1">   
    <price>44.95</price>
    <genre>Computer</genre>
    <title>XML Developer's Guide</title>
</price>
</book>

Upvotes: 0

Answers (1)

KZapagol

Reputation: 928

You can create schema for above xml file as below.

val innerSchema = StructType(
StructField("price",
ArrayType(
  StructType(
    StructField("price",LongType,true)::
      StructField("genre",StringType,true)::Nil
      StructField("title",StringType,true)::Nil
  )
),true)::Nil
)

val schema = StructType(
 StructField("author",StringType,true)::
 StructField("publish_date", StringType, nullable = true),
 StructField("description", StringType, nullable = true),
 StructField("price", innerSchema, true):: Nil
 )

Apply this schema to read xml file

 val df = spark.sqlContext.read.format("com.databricks.spark.xml")
          .option("rowTag", "Item")
          .schema(schema)
          .load(xmlFile)
          //Selecy nested field and explode to get the flattern result
          .select("author", "publish_date", "description","price.*") // select 
                  required column

You can let the spark to infer schema itself will get the same result

val df = spark.sqlContext.read.format("com.databricks.spark.xml")
  .option("rowTag", "Item")
 //.schema(schema)
 .load(xmlFile)
 .select("author", "publish_date", "description","price.*") // select required column

Save data to Hive table.

   df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");

Even you can save DF to Database as below.

  //create properties object
  val prop = new java.util.Properties
  prop.setProperty("driver", "com.mysql.jdbc.Driver")
  prop.setProperty("user", "root")
  prop.setProperty("password", "pw") 

  //jdbc mysql url - destination database is named "data"
  val url = "jdbc:mysql://localhost:3306/data"

  //destination database table 
  val table = "sample_data_table"

  //write data from spark dataframe to database
  df.write.mode("append").jdbc(url, table, prop)

Update :

To save DataFrame as csv

You can use databricks spark-csv. https://github.com/databricks/spark-csv
```
df.write.format("com.databricks.spark.csv").save(filepath)
```
With Spark 2.x the spark-csv package is not needed as it's included in Spark.
```
df.write.format("csv").save(filepath)
```

Please refer https://github.com/databricks/spark-xml. Hope it helps!

Upvotes: 2

How to load complex xml files containing more than 1 row tag into dataframe using spark scala and save it as table(note generic solution)

Answers (1)

Related Questions