Reputation: 2628
I am learning Spark, I have below xml from which I want to read 2 values and create two different columns
<appRoot>
<applist>
<app type="test">
<code>8.52544</code>
</app>
<app type="dev">
<code>8.52537</code>
</app>
</applist>
</appRoot>
I want
if type="test" then it should set the value (i.e. 8.52544) in new column "app_test" and
if type="dev" then it should set the value (i.e. 8.52537) in new column "app_dev"
I tried below
df.select(
functions.when($"applist.app._type" === "test", $"applist.app.code").as("app_test"),
functions.when($"applist.app._type" === "dev", $"applist.app.code").as("app_dev"))
but it returns
app_test with value [8.52544, 8.52537]
app_dev with value [8.52544, 8.52537]
How can i differentiate while setting the value to the column?
Update:
val df = spark.read
.format("com.databricks.spark.xml")
.option("rootTag", "appRoot")
.option("rowTag", "applist")
.load("test.xml")
df.printSchema()
root
|-- app: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- code: double (nullable = true)
Upvotes: 1
Views: 529
Reputation: 14905
Using map_from_entries the array can be transformed into a map using _type
as keys and code
as values. Then the map can be used to create the two new columns.
val df = spark.read.format("com.databricks.spark.xml")
.option("rootTag", "appRoot").option("rowTag","applist")
.load(...)
df.withColumn("map", map_from_entries('app))
.withColumn("app_test", 'map("test"))
.withColumn("app_dev", 'map("dev"))
.drop('map)
.show(false)
prints (with some additional test data)
+--------------------------------------------+--------+-------+
|app |app_test|app_dev|
+--------------------------------------------+--------+-------+
|[[test, 8.52544], [dev, 8.52537], [x, 1.23]]|8.52544 |8.52537|
|[[test, 1.2], [dev, 3.4], [X, 9.9]] |1.2 |3.4 |
|[[dev, -5.6], [Z, -12.9]] |null |-5.6 |
+--------------------------------------------+--------+-------+
Upvotes: 1
Reputation: 532
You can do this by:
Read the data into a dataframe.
val df = spark.read.format("com.databricks.spark.xml").option("rootTag", "appRoot").option("rowTag","applist").load("/home/shrey/Downloads/mydoc.xml")
df.printSchema()
df.show()
Finally, you can select the appropriate columns in when condition by exploding the array:
df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).show(false)
If you want the output in one line can you can aggregate it using max() functions.
df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).agg(max($"app_test").as("app_test"),max($"app_dev").as("app_dev")).show(false)
Upvotes: 0