how to explode a spark dataframe

Question

I exploded a nested schema but I am not getting what I want,

before exploded it looks like this:

df.show()

+----------+----------------------------------------------------------+
|CaseNumber|                   SourceId                               |
+----------+----------------------------------------------------------+
|       0  |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
|       1  |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |                                             
+---------------------------------------------------------------------+

I want it to be like this

+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
|       0  | 1    |      22    |
+----------|------|------------|   
|       1  | 3    |      24    | 
+------------------------------|

abiratsis · Accepted Answer

Here is one way using the build-in get_json_object function:

import org.apache.spark.sql.functions.get_json_object

val df = Seq(
  (0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
  (1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")

df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
  .withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
  .drop("SourceId")
  .show

// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// |         0|  1|        22|
// |         1|  3|        24|
// +----------+---+----------+

UPDATE

After our discussion we realised that the mentioned data is of array> type and not a simple string. Next is the solution for the new schema:

df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
  .withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

how to explode a spark dataframe

Answers (1)

Related Questions