Sujai Sparks
Sujai Sparks

Reputation: 93

How to convert records in Azure Databricks delta table to a nested JSON structure?

Let's say I have a delta table in Azure databricks that stores the staff details (denormalized).

sample data illustration

I wanted to export the data in the JSON format and save it as a single file on a storage location. I need help with the databricks sql query to group/construct the data in the JSON format.

Here is the sample code and desired output: Delta Table schema:

%sql
create table if not exists staff_details (
department_id int comment 'id of the department',
department_name string comment 'name of the department',
employee_id int comment 'employee id of the staff',
first_name string comment 'first name of the staff',
last_name string comment 'last name of the staff'
)
using delta

Script to populate the delta table:

%sql
insert into staff_details(department_id, department_name, employee_id, first_name, last_name)
values(1,'Dept-A',101,'Guru','Datt'), (1,'Dept-A',102,'Tom','Cruise'), (2,'Dept-B',201,'Angelina','Jolie')

Show records:

%sql
select * from staff_details order by department_id, employee_id

Desired output:

{
    "staff_details":[
        {
            "department_id":1,
            "department_name": "Dept-A",
            "staff_members": [
                {
                    "employee_id":101,
                    "first_name":"Guru",
                    "last_name":"Datt"
                },
                {
                    "employee_id":102,
                    "first_name":"Tom",
                    "last_name":"Cruise"
                }                
                
            ]
        },
        {
            "department_id":2,
            "department_name": "Dept-B",
            "staff_members": [
                {
                    "employee_id":201,
                    "first_name":"Angelina",
                    "last_name":"Jolie"
                }             
                
            ]
        }
    ]
}

I tried using the to_json() function, and also using manual string concatenation with group by etc, but none of that is working well.

Upvotes: 1

Views: 1282

Answers (1)

Aswin
Aswin

Reputation: 7156

The below spark sql query will convert the staff_details table in nested json format.

select collect_set(to_json(struct(*))) as staff_details 
from (SELECT department_id, department_name,
collect_set(to_json(struct(employee_id, first_name,
last_name))) as staff_members FROM staff_details
group by department_id,department_name)a

enter image description here

  • You can save the query result to data frame and then using pandas, save the data frame in a single file. Below is the code.
%python
df=spark.sql("select collect_set(to_json(struct(*))) as json_data from (select collect_set(to_json(struct(*))) as staff_details from (SELECT department_id,department_name,collect_set(to_json(struct(employee_id,first_name,last_name))) as staff_members FROM staff_details group by department_id,department_name)a)b")
#df.show()
import pandas
pandas_converted=df.toPandas()
pandas_converted.to_json("/dbfs/tmp/staff_details3.json")

Upvotes: 2

Related Questions