Reputation: 107
I'm having a dataframe with the below details and I need to iterate over each row to create the below formatted nested json string.
Employee name | Salary | sick_leave_day | paid_leave_day |
---|---|---|---|
Karthi | 20000 | 10-10-2021 | Y |
Karthi | 20000 | 11-12-2021 | Y |
Karthi | 20000 | 13-12-2021 | N |
Rajesh | 25000 | 15-12-2021 | Y |
Rajesh | 25000 | 17-11-2021 | N |
Rajesh | 25000 | 10-10-2021 | Y |
The output json should look like below, Here, Name
and salary
is having unique data and sick_leave_day
,paid_leave_day
changes for each row.
{
"Name": "Karthi",
"Salary": "20000",
leave_details: [{
"sick_leave_day": "10-10-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "11-12-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "13-12-2021",
"paid_leave_day ": "N"
}
]
}
I have tried to iterate over dataframe using foreach
method, but I'm unable to get the logic. If I group the name and salary, I can't get the value of name and salary only once. So, please help me to write the logic by iterating over each row to create this logic. Thanks in advance.
Upvotes: 0
Views: 1161
Reputation: 32640
Group by Employee name
and collect list of structs for leave_details
array and first (or max) Salary
then save a json:
val result = df.groupBy(col("Employee name").as("name")).agg(
first($"Salary").as("Salary"),
collect_list(struct($"sick_leave_day", $"paid_leave_day")).as("leave_details")
)
result.toJSON.collect.foreach(println(_))
//{"name":"Rajesh","Salary":"25000","leave_details":[{"sick_leave_day":"15-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"17-11-2021","paid_leave_day":"N"},{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"}]}
//{"name":"Karthi","Salary":"20000","leave_details":[{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"},{"sick_leave_day":"11-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"13-12-2021","paid_leave_day":"N"}]}
Upvotes: 0
Reputation: 3059
You can use the collect_list aggregation function to group by Employee and get the list of leave days.
For instance you can manually create a JSON with the 2 "leave_day" columns then aggregate using collect_list:
var result = df.withColumn("new_col", concat(lit("{ sick_leave_day:"), df.col("sick_leave_day"), lit(", paid_leave_day:"), df.col("paid_leave_day"), lit("}")))
.groupBy("Employee name", "Salary")
.agg(collect_list("new_col"))
result.show(truncate=false)
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+ |Employee name|Salary|collect_list(new_col) | +-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+ |Rajesh |25000 |[{ sick_leave_day:15-12-2021, paid_leave_day:Y}, { sick_leave_day:17-11-2021, paid_leave_day:N}, { sick_leave_day:10-10-2021, paid_leave_day:Y}]| |Karthi |20000 |[{ sick_leave_day:10-10-2021, paid_leave_day:Y}, { sick_leave_day:11-12-2021, paid_leave_day:Y}, { sick_leave_day:13-12-2021, paid_leave_day:N}]| +-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
Upvotes: 0