Reputation: 93
Let's say I have a delta table in Azure databricks that stores the staff details (denormalized).
I wanted to export the data in the JSON format and save it as a single file on a storage location. I need help with the databricks sql query to group/construct the data in the JSON format.
Here is the sample code and desired output: Delta Table schema:
%sql
create table if not exists staff_details (
department_id int comment 'id of the department',
department_name string comment 'name of the department',
employee_id int comment 'employee id of the staff',
first_name string comment 'first name of the staff',
last_name string comment 'last name of the staff'
)
using delta
Script to populate the delta table:
%sql
insert into staff_details(department_id, department_name, employee_id, first_name, last_name)
values(1,'Dept-A',101,'Guru','Datt'), (1,'Dept-A',102,'Tom','Cruise'), (2,'Dept-B',201,'Angelina','Jolie')
Show records:
%sql
select * from staff_details order by department_id, employee_id
Desired output:
{
"staff_details":[
{
"department_id":1,
"department_name": "Dept-A",
"staff_members": [
{
"employee_id":101,
"first_name":"Guru",
"last_name":"Datt"
},
{
"employee_id":102,
"first_name":"Tom",
"last_name":"Cruise"
}
]
},
{
"department_id":2,
"department_name": "Dept-B",
"staff_members": [
{
"employee_id":201,
"first_name":"Angelina",
"last_name":"Jolie"
}
]
}
]
}
I tried using the to_json() function, and also using manual string concatenation with group by etc, but none of that is working well.
Upvotes: 1
Views: 1282
Reputation: 7156
The below spark sql query will convert the staff_details table in nested json format.
select collect_set(to_json(struct(*))) as staff_details
from (SELECT department_id, department_name,
collect_set(to_json(struct(employee_id, first_name,
last_name))) as staff_members FROM staff_details
group by department_id,department_name)a
%python
df=spark.sql("select collect_set(to_json(struct(*))) as json_data from (select collect_set(to_json(struct(*))) as staff_details from (SELECT department_id,department_name,collect_set(to_json(struct(employee_id,first_name,last_name))) as staff_members FROM staff_details group by department_id,department_name)a)b")
#df.show()
import pandas
pandas_converted=df.toPandas()
pandas_converted.to_json("/dbfs/tmp/staff_details3.json")
Upvotes: 2