Lokesh
Lokesh

Reputation: 87

xml field in JSON data

I would like to know how to read and parse the xml field which is a part of JSON data.

root
 |-- fields: struct (nullable = true)
 |    |-- custid: string (nullable = true)
 |    |-- password: string (nullable = true)
 |    |-- role: string (nullable = true)
 |    |-- xml_data: string (nullable = true)

and that xml_data has lots of column in it. Lets say these fields inside the XML_data are like nested cols of the "FIELDS" data. So how to parse all the columns "custid","password","role","xml_data.refid","xml_data.refname" all of them into one data frame.

Long question short, how to parse and read xml data that is inside a JSON file as a String content.

Upvotes: 1

Views: 2469

Answers (1)

hagarwal
hagarwal

Reputation: 1163

This is little tricky, however could be achieve in below simple steps:

  • Parse XML String to JSON String and append identifier to it (below case: ' ')
  • Convert entire Dataframe to Dataset of JSON String
  • Map on Dataset of String, create a valid JSON via identifying the identifier appended in step 1.
  • Convert Dataset of Valid JSON to Dataframe, That's it done!!

That's it done!!

import spark.implicits._
import scala.xml.XML
import org.json4s.Xml.toJson
import org.json4s.jackson.JsonMethods.{compact, render}
import org.apache.spark.sql.functions.udf

val rdd = spark
  .sparkContext
  .parallelize(Seq("{\"fields\":{\"custid\":\"custid\",\"password\":\"password\",\"role\":\"role\",\"xml_data\":\"<person><refname>Test Person</refname><country>India</country></person>\"}}"))

val df = spark.read.json(rdd.toDS())

val xmlToJsonUDF = udf { xmlString: String =>
  val xml = XML.loadString(xmlString)
  s"''${compact(render(toJson(xml)))}''"
}

val xmlParsedDf = df.withColumn("xml_data", xmlToJsonUDF(col("fields.xml_data")))
val jsonDs = xmlParsedDf.toJSON

val validJsonDs = jsonDs.map(value => {
  val startIndex = value.indexOf("\"''")
  val endIndex = value.indexOf("''\"")
  val data = value.substring(startIndex, endIndex).replace("\\", "")
  val validJson = s"${value.substring(0, startIndex)}$data${value.substring(endIndex)}"
    .replace("\"''", "")
    .replace("''\"", "")
  validJson
})

val finalDf = spark.read.json(validJsonDs)
finalDf.show(10)
finalDf.printSchema()

finalDf
  .select("fields.custid", "fields.password", "fields.role", "fields.xml_data", "xml_data.person.refname", "xml_data.person.country")
  .show(10)

Input & Output:

//Input
{"fields":{"custid":"custid","password":"password","role":"role","xml_data":"<person><refname>Test Person</refname><country>India</country></person>"}}

//Final Dataframe
+--------------------+--------------------+
|              fields|            xml_data|
+--------------------+--------------------+
|[custid, password...|[[India, Test Per...|
+--------------------+--------------------+

//Final Dataframe Schema
root
|-- fields: struct (nullable = true)
|    |-- custid: string (nullable = true)
|    |-- password: string (nullable = true)
|    |-- role: string (nullable = true)
|    |-- xml_data: string (nullable = true)
|-- xml_data: struct (nullable = true)
|    |-- person: struct (nullable = true)
|    |    |-- country: string (nullable = true)
|    |    |-- refname: string (nullable = true)

Upvotes: 2

Related Questions