Reputation: 111
I have a custom Airbyte job that fails to normalize the data, so I need to do it manually. The following data is pulled from our HR system:
{
"title": "My Report",
"fields": [{
"id": "employeeNumber",
"name": "Employee #"
},
{
"id": "firstName"
"name": "First Name"
},
{
"id": "lastName"
"name": "Last Name"
}],
"employees": [{
"employeeNumber": "1234",
"firstName": "Ann",
"lastName": "Perkins"
},
{
"employeeNumber": "5678",
"firstName": "Bob",
"lastName": "Builder"
}]
}
My current bigquery table looks like this (the json is stored as a string):
_airbyte_ab_id | _airbyte_emitted_at | _airbyte_data |
---|---|---|
123abc | 2022-01-30 19:41:59 UTC | {"title": "My Datawareouse", "fields": [ {"id": "employeeNumber", "name": "Employee_Number"}, {"id": "firstName", "name": "First_Name" }, { "id": "lastName", "name": "Last_Name"} ], "employees": [ { "employeeNumber": "1234", "firstName": "Ann", "lastName": "Perkins" }, { "employeeNumber": "5678", "firstName": "Bob", "lastName": "Builder" } ] } |
I am trying to normalize the table to look like this:
_airbyte_ab_id | _airbyte_emitted_at | Employee_Number | First_Name | Last_Name |
---|---|---|---|---|
123abc | 2022-01-30 19:41:59 UTC | 1234 | Ann | Perkins |
123abc | 2022-01-30 19:41:59 UTC | 5678 | Bob | Builder |
How to flatten the json into columns as the example above, using SQL in bigquery? (The script will be running from dbt, but for now, I am just trying to get a valid query to run)
I should add that the actual json has far more fields, they might change, and I expect null values for things like "Middle Name" and the like. So, in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array.
Upvotes: 2
Views: 2947
Reputation: 172944
... in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array
For case when your have fields defined dynamically and potentially even different from row to row - i recommend considering below flattening approach
select _airbyte_ab_id, _airbyte_emitted_at,
md5(employee) employee_hash,
json_value(field, "$.id") key,
regexp_extract(employee, r'"' || json_value(field, "$.id") || '":"(.*?)"') value
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee,
unnest(json_extract_array(_airbyte_data, '$.fields')) field
if applied to sample data in your question - output is
Upvotes: 1
Reputation: 172944
How to flatten the json into columns as the example above, using SQL in bigquery?
Consider below approach
select _airbyte_ab_id, _airbyte_emitted_at,
json_value(employee, '$.employeeNumber') employeeNumber,
json_value(employee, '$.firstName') firstName,
json_value(employee, '$.lastName') lastName
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee
if applied to sample data in your question - output is
Upvotes: 1