Chr1sCon
Chr1sCon

Reputation: 111

Flatten json string in BigQuery

I have a custom Airbyte job that fails to normalize the data, so I need to do it manually. The following data is pulled from our HR system:


{
  "title": "My Report", 
  "fields": [{
      "id": "employeeNumber", 
      "name": "Employee #"
    }, 
    {
      "id": "firstName" 
      "name": "First Name"
    }, 
    { 
      "id": "lastName"
      "name": "Last Name"
    }], 
    "employees": [{ 
      "employeeNumber": "1234", 
      "firstName": "Ann", 
      "lastName": "Perkins" 
    }, 
    { 
      "employeeNumber": "5678", 
      "firstName": "Bob", 
      "lastName": "Builder" 
    }]
}

My current bigquery table looks like this (the json is stored as a string):

_airbyte_ab_id _airbyte_emitted_at _airbyte_data
123abc 2022-01-30 19:41:59 UTC {"title": "My Datawareouse", "fields": [ {"id": "employeeNumber", "name": "Employee_Number"}, {"id": "firstName", "name": "First_Name" }, { "id": "lastName", "name": "Last_Name"} ], "employees": [ { "employeeNumber": "1234", "firstName": "Ann", "lastName": "Perkins" }, { "employeeNumber": "5678", "firstName": "Bob", "lastName": "Builder" } ] }

I am trying to normalize the table to look like this:

_airbyte_ab_id _airbyte_emitted_at Employee_Number First_Name Last_Name
123abc 2022-01-30 19:41:59 UTC 1234 Ann Perkins
123abc 2022-01-30 19:41:59 UTC 5678 Bob Builder

How to flatten the json into columns as the example above, using SQL in bigquery? (The script will be running from dbt, but for now, I am just trying to get a valid query to run)

I should add that the actual json has far more fields, they might change, and I expect null values for things like "Middle Name" and the like. So, in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array.

Upvotes: 2

Views: 2947

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

... in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array

For case when your have fields defined dynamically and potentially even different from row to row - i recommend considering below flattening approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  md5(employee) employee_hash,
  json_value(field, "$.id") key,
  regexp_extract(employee, r'"' || json_value(field, "$.id") || '":"(.*?)"') value
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee,
unnest(json_extract_array(_airbyte_data, '$.fields')) field       

if applied to sample data in your question - output is

enter image description here

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

How to flatten the json into columns as the example above, using SQL in bigquery?

Consider below approach

select _airbyte_ab_id, _airbyte_emitted_at, 
  json_value(employee, '$.employeeNumber') employeeNumber,
  json_value(employee, '$.firstName') firstName,
  json_value(employee, '$.lastName') lastName
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee         

if applied to sample data in your question - output is

enter image description here

Upvotes: 1

Related Questions