Allforone
Allforone

Reputation: 84

How to load complex xml files containing more than 1 row tag into dataframe using spark scala and save it as table(note generic solution)

(This is a sample xml file with 2 row tags(any approach to load this to dataframe with n numbers of row tags or elements using xpath in spark scala)

<book id="0">
    <author>Matthew</author>
    <publish_date>Sun Oct 01 00:00:00 EDT 2000</publish_date>
    <description>An in-depth look at creating applications with XML. 
    </description>
 <price id = "1">   
    <price>44.95</price>
    <genre>Computer</genre>
    <title>XML Developer's Guide</title>
</price>
</book>

Upvotes: 0

Views: 2491

Answers (1)

KZapagol
KZapagol

Reputation: 928

  1. You can create schema for above xml file as below.

    val innerSchema = StructType(
    StructField("price",
    ArrayType(
      StructType(
        StructField("price",LongType,true)::
          StructField("genre",StringType,true)::Nil
          StructField("title",StringType,true)::Nil
      )
    ),true)::Nil
    )
    
    val schema = StructType(
     StructField("author",StringType,true)::
     StructField("publish_date", StringType, nullable = true),
     StructField("description", StringType, nullable = true),
     StructField("price", innerSchema, true):: Nil
     )
    
  2. Apply this schema to read xml file

     val df = spark.sqlContext.read.format("com.databricks.spark.xml")
              .option("rowTag", "Item")
              .schema(schema)
              .load(xmlFile)
              //Selecy nested field and explode to get the flattern result
              .select("author", "publish_date", "description","price.*") // select 
                      required column
    

    You can let the spark to infer schema itself will get the same result

    val df = spark.sqlContext.read.format("com.databricks.spark.xml")
      .option("rowTag", "Item")
     //.schema(schema)
     .load(xmlFile)
     .select("author", "publish_date", "description","price.*") // select required column
    
  3. Save data to Hive table.

       df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");
    
  4. Even you can save DF to Database as below.

      //create properties object
      val prop = new java.util.Properties
      prop.setProperty("driver", "com.mysql.jdbc.Driver")
      prop.setProperty("user", "root")
      prop.setProperty("password", "pw") 
    
      //jdbc mysql url - destination database is named "data"
      val url = "jdbc:mysql://localhost:3306/data"
    
      //destination database table 
      val table = "sample_data_table"
    
      //write data from spark dataframe to database
      df.write.mode("append").jdbc(url, table, prop)
    

Update :

To save DataFrame as csv

  1. You can use databricks spark-csv. https://github.com/databricks/spark-csv

    df.write.format("com.databricks.spark.csv").save(filepath)
    
  2. With Spark 2.x the spark-csv package is not needed as it's included in Spark.

    df.write.format("csv").save(filepath)
    

Please refer https://github.com/databricks/spark-xml. Hope it helps!

Upvotes: 2

Related Questions