Reputation: 399
I need to load the data to the spark dataframe which is having some XML and text content. Below is my data format.
1,2003,4349,<c><ab a="Roy" b="201"/><ab a="Joe" b="202"/></c>,54,M
I need to get the final output as below.
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+
| Month|Year | pincode | name | id | manager_name|manager_id |dep_id |Gender |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+--------+--------+
|1 |2003 |4348 | Roy | 201 |Joe | 202 | 54 |M |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+
Upvotes: 1
Views: 553
Reputation: 5487
We can get the desired result using spark-xml library.
import com.databricks.spark.xml._
import com.databricks.spark.xml.functions.from_xml
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val df = // Read csv file
// Assuming your xml content column name is xmldata
val xmlSchema = schema_of_xml(df.select("xmldata").as[String])
df.withColumn("xmldata", from_xml('xmldata, xmlSchema))
.select("*", "xmldata.ab")
.selectExpr(df.columns.diff(Array("xmldata")) ++
Array("ab[0]._a as name", "ab[0]._b as id", "ab[1]._a as manager_name", "ab[1]._b as manager_id"): _*)
.show(false)
/*
+-----+----+-------+------+------+----+---+------------+----------+
|Month|Year|pincode|dep_id|Gender|name|id |manager_name|manager_id|
+-----+----+-------+------+------+----+---+------------+----------+
|1 |2003|4349 |54 |M |Roy |201|Joe |202 |
+-----+----+-------+------+------+----+---+------------+----------+ /*
Upvotes: 1