Curious Freak
Curious Freak

Reputation: 21

Scala Spark Dataframe sum list of json values in the column

I have a spark data frame as given below:

id col1 col2
1 [{"a":1}] [{"d": 3, "e": 4}]
2 [{"a":2}] [{"d": 5, "e": 10}]

I want to obtain the following data frame:

id col2_sum
1 7
2 10

Datatypes:

id:StringType

col1:StringType

col2:StringType

Thanks in advance

Upvotes: 0

Views: 398

Answers (1)

blackbishop
blackbishop

Reputation: 32710

Convert JSON string into map type using from_json then use aggregate function to sum the map values:

val df = Seq(
  (1, """[{"a":1}]""", """[{"d": 3, "e": 4}]"""),
  (2, """[{"a":2}]""", """[{"d": 5, "e": 10}]""")
).toDF("id", "col1", "col2")

val df1 = (df
  .withColumn("col2", from_json(col("col2"), lit("array<map<string,int>>")))
  .withColumn("col2", flatten(expr("transform(col2, x -> map_values(x))")))
  .withColumn("col2_sum", expr("aggregate(col2, 0, (acc, x) -> acc + x)"))
  .drop("col1", "col2")
)

df1.show
//+---+--------+
//| id|col2_sum|
//+---+--------+
//|  1|       7|
//|  2|      15|
//+---+--------+

Upvotes: 1

Related Questions