happy
happy

Reputation: 2628

How to access array type value and set in two different columns spark?

I am learning Spark, I have below xml from which I want to read 2 values and create two different columns

<appRoot>
    <applist>
      <app type="test">
        <code>8.52544</code>
      </app>
      <app type="dev">
        <code>8.52537</code>
      </app>
    </applist>
</appRoot>

I want

if type="test" then it should set the value (i.e. 8.52544) in new column "app_test" and

if type="dev" then it should set the value (i.e. 8.52537) in new column "app_dev"

I tried below

df.select(
functions.when($"applist.app._type" === "test", $"applist.app.code").as("app_test"),
functions.when($"applist.app._type" === "dev", $"applist.app.code").as("app_dev"))

but it returns

app_test with value [8.52544, 8.52537]
app_dev with value [8.52544, 8.52537]

How can i differentiate while setting the value to the column?

Update:

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rootTag", "appRoot")
      .option("rowTag", "applist")
      .load("test.xml")

    df.printSchema()



root
 |-- app: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- code: double (nullable = true)

Upvotes: 1

Views: 529

Answers (2)

werner
werner

Reputation: 14905

Using map_from_entries the array can be transformed into a map using _type as keys and code as values. Then the map can be used to create the two new columns.

val df = spark.read.format("com.databricks.spark.xml")
   .option("rootTag", "appRoot").option("rowTag","applist")
   .load(...)

df.withColumn("map", map_from_entries('app))
   .withColumn("app_test", 'map("test"))
   .withColumn("app_dev", 'map("dev"))
   .drop('map)
   .show(false)

prints (with some additional test data)

+--------------------------------------------+--------+-------+
|app                                         |app_test|app_dev|
+--------------------------------------------+--------+-------+
|[[test, 8.52544], [dev, 8.52537], [x, 1.23]]|8.52544 |8.52537|
|[[test, 1.2], [dev, 3.4], [X, 9.9]]         |1.2     |3.4    |
|[[dev, -5.6], [Z, -12.9]]                   |null    |-5.6   |
+--------------------------------------------+--------+-------+

Upvotes: 1

Shrey Jakhmola
Shrey Jakhmola

Reputation: 532

You can do this by:

Read the data into a dataframe.

val df = spark.read.format("com.databricks.spark.xml").option("rootTag", "appRoot").option("rowTag","applist").load("/home/shrey/Downloads/mydoc.xml")
df.printSchema()
df.show()

enter image description here

Finally, you can select the appropriate columns in when condition by exploding the array:

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).show(false)

enter image description here

If you want the output in one line can you can aggregate it using max() functions.

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).agg(max($"app_test").as("app_test"),max($"app_dev").as("app_dev")).show(false)

's

Upvotes: 0

Related Questions