nirmal
nirmal

Reputation: 107

Iterate over a spark dataframe and create a nested json string

I'm having a dataframe with the below details and I need to iterate over each row to create the below formatted nested json string.

Employee name Salary sick_leave_day paid_leave_day
Karthi 20000 10-10-2021 Y
Karthi 20000 11-12-2021 Y
Karthi 20000 13-12-2021 N
Rajesh 25000 15-12-2021 Y
Rajesh 25000 17-11-2021 N
Rajesh 25000 10-10-2021 Y

The output json should look like below, Here, Name and salary is having unique data and sick_leave_day,paid_leave_day changes for each row.

{
    "Name": "Karthi",
    "Salary": "20000",
    leave_details: [{
            "sick_leave_day": "10-10-2021",
            "paid_leave_day ": "Y"
        },
        {
            "sick_leave_day": "11-12-2021",
            "paid_leave_day ": "Y"
        },
        {
            "sick_leave_day": "13-12-2021",
            "paid_leave_day ": "N"
        }
    ]
}

I have tried to iterate over dataframe using foreach method, but I'm unable to get the logic. If I group the name and salary, I can't get the value of name and salary only once. So, please help me to write the logic by iterating over each row to create this logic. Thanks in advance.

Upvotes: 0

Views: 1161

Answers (2)

blackbishop
blackbishop

Reputation: 32640

Group by Employee name and collect list of structs for leave_details array and first (or max) Salary then save a json:

val result = df.groupBy(col("Employee name").as("name")).agg(
    first($"Salary").as("Salary"),
    collect_list(struct($"sick_leave_day", $"paid_leave_day")).as("leave_details")
)

result.toJSON.collect.foreach(println(_))

//{"name":"Rajesh","Salary":"25000","leave_details":[{"sick_leave_day":"15-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"17-11-2021","paid_leave_day":"N"},{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"}]}
//{"name":"Karthi","Salary":"20000","leave_details":[{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"},{"sick_leave_day":"11-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"13-12-2021","paid_leave_day":"N"}]}

Upvotes: 0

Fabich
Fabich

Reputation: 3059

You can use the collect_list aggregation function to group by Employee and get the list of leave days.

For instance you can manually create a JSON with the 2 "leave_day" columns then aggregate using collect_list:

var result = df.withColumn("new_col", concat(lit("{ sick_leave_day:"), df.col("sick_leave_day"), lit(", paid_leave_day:"), df.col("paid_leave_day"), lit("}")))
    .groupBy("Employee name", "Salary")
    .agg(collect_list("new_col"))

result.show(truncate=false)

+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Employee name|Salary|collect_list(new_col)                                                                                                                           |
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Rajesh       |25000 |[{ sick_leave_day:15-12-2021, paid_leave_day:Y}, { sick_leave_day:17-11-2021, paid_leave_day:N}, { sick_leave_day:10-10-2021, paid_leave_day:Y}]|
|Karthi       |20000 |[{ sick_leave_day:10-10-2021, paid_leave_day:Y}, { sick_leave_day:11-12-2021, paid_leave_day:Y}, { sick_leave_day:13-12-2021, paid_leave_day:N}]|
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+

Upvotes: 0

Related Questions