gopinath kolanchi
gopinath kolanchi

Reputation: 180

Snowflake - Infer Schema from JSON data in Variant Column Dynamically

Experts,

We have a scenario to infer schema from JSON data loaded in the table. It has to be done dynamically and also the JSON data in table would be of different schema.

Example:

row 1-> address <array>[ id string ,name string ]
row 2-> address<array> [addr<object> {id:"1",name:"abc"}]
row 3-> address<array> [addr<object> {id:"2",name:"dfg",Zips<array>[zip1:6009,zip2:789]}]

I am aware we can use LATERAL FLATTEN & recursive to infer the schema. However when we need to shred Zips the above data we need have flatten query as below.

LATERAL FLATTEN (jsondata:address ,recursive =>true) a
LATERAL FLATTEN (a.value:addr,recursive =>true) b -> this is causing issue
LATERAL FLATTEN (c.value:Zips,recursive =>true) c 

When we flatten object data type it is flattening to element level, is there a way to check and dynamically avoid flattening of object.

Regards, Gopi

Upvotes: 0

Views: 3485

Answers (1)

user13472370
user13472370

Reputation:

Snowflake' semi-structured data query features provide data-type inspection functions that can be used to conditionally handle such a varied input within a single table column.

In particular, after breaking down the outer arrays into whole rows, you can use IS_ARRAY, IS_OBJECT, and the : operator (with NULL result checks) functions to separate the record producing logic, and then combine the rows into a single output with a UNION ALL.

The question lacks a clear/usable sample or schema of data (and an expected output) so I've made four assumptions below on what the data looks like and added a filter for each type from the root. The general idea remains the same for each (check type, divide dataset, process each type), you should be able to infer and adjust.

WITH tbl AS (
-- Sample table data

  select parse_json('{"address": [["sa_id1", "sa_name1"], ["sa_id2", "sa_name2"]], "other_outer_field": 1}') jsondata
  union all
  select parse_json('{"address": [{"id": "sr_id1", "name": "sr_name1"}, {"id": "sr_id2", "name": "sr_name2"}], "other_outer_field": 2}') jsondata
  union all
  select parse_json('{"address": [{"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, {"id": "zr_id2", "name": "zr_name2", "zips": ["20001", "20002"]}], "other_outer_field": 3}') jsondata
  union all
  select parse_json('{"address": {"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, "other_outer_field": 4}') jsondata

), all_address_array_formats AS (
-- Table's actual row: { …, "address": [ … ], … } when the address field is an array

  SELECT
    jsondata:other_outer_field AS o_f,
    each_address.value AS address_container
  FROM tbl, LATERAL FLATTEN(jsondata:address) each_address
  WHERE IS_ARRAY(jsondata:address)
  
), all_address_object_formats AS (
-- Table's actual row: { …, "address": { … }, … } when the address field is an object (we do not need to flatten here)

  SELECT
    jsondata:other_outer_field AS o_f,
    jsondata:address AS address_container
  FROM tbl
  WHERE IS_OBJECT(jsondata:address)

), just_array_members AS (
-- For address array with nested arrays: [ [id1, name1], [id2, name2], … ]

  SELECT
    o_f,
    address_container[0]::varchar AS id,
    address_container[1]::varchar AS name,
    NULL AS zipcode
  FROM all_address_array_formats
  WHERE
    IS_ARRAY(address_container)

), simple_record_members AS (
-- For address array with objects, but no zipcode fields: [ { id: id1, name: name1 }, { id: id2, name: name2 }, … ]

  SELECT
    o_f,
    address_container:id::varchar AS id,
    address_container:name::varchar AS name,
    NULL AS zipcode
  FROM all_address_array_formats
  WHERE
    IS_OBJECT(address_container)
    AND address_container:zips IS NULL

), zipcode_record_members AS (
-- For address array with objects, each with multiple zipcodes: [ { id: id1, name: name1, zips: [ zip1_1, zip1_2, … ] }, { id: id2, name: name2, zips: [zip2_1, zip2_2, …] }, … ]

  SELECT
    o_f,
    address_container:id::varchar AS id,
    address_container:name::varchar AS name,
    per_zip.value::varchar AS zipcode
  FROM all_address_array_formats, LATERAL FLATTEN(address_container:zips) per_zip
  WHERE
    IS_OBJECT(address_container)
    AND address_container:zips IS NOT NULL

), zipcodes_within_object AS (
-- For address of object type, a single one with multiple zipcodes: { id: id1, name: name1, zips: [ zip1_1, zip1_2, … ] }

  SELECT
    o_f,
    address_container:id::varchar AS id,
    address_container:name::varchar AS name,
    per_zip.value::varchar AS zipcode
  FROM all_address_object_formats, LATERAL FLATTEN(address_container:zips) per_zip
  WHERE
    IS_OBJECT(address_container)
    AND address_container:zips IS NOT NULL
  
)
SELECT o_f, id, name, zipcode FROM just_array_members UNION ALL
SELECT o_f, id, name, zipcode FROM simple_record_members UNION ALL
SELECT o_f, id, name, zipcode FROM zipcode_record_members UNION ALL
SELECT o_f, id, name, zipcode FROM zipcodes_within_object;

Note: The example also shows how to continue to carry any other fields apart from address from the original object in the table (column: o_f) that are not broken down via the flatten function.

For the input:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                   
| JSONDATA                                                                                                                                                                           |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| {"address": [["sa_id1", "sa_name1"], ["sa_id2", "sa_name2"]], "other_outer_field": 1}                                                                                              |
| {"address": [{"id": "sr_id1", "name": "sr_name1"}, {"id": "sr_id2", "name": "sr_name2"}], "other_outer_field": 2}                                                                  |
| {"address": [{"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, {"id": "zr_id2", "name": "zr_name2", "zips": ["20001", "20002"]}], "other_outer_field": 3} |
| {"address": {"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, "other_outer_field": 4}                                                                     |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This produces:

+-----+--------+----------+---------+                                           
| O_F | ID     | NAME     | ZIPCODE |
|-----+--------+----------+---------|
| 1   | sa_id1 | sa_name1 | NULL    |
| 1   | sa_id2 | sa_name2 | NULL    |
| 2   | sr_id1 | sr_name1 | NULL    |
| 2   | sr_id2 | sr_name2 | NULL    |
| 3   | zr_id1 | zr_name1 | 10001   |
| 3   | zr_id1 | zr_name1 | 10002   |
| 3   | zr_id1 | zr_name1 | 10003   |
| 3   | zr_id2 | zr_name2 | 20001   |
| 3   | zr_id2 | zr_name2 | 20002   |
| 4   | zr_id1 | zr_name1 | 10001   |
| 4   | zr_id1 | zr_name1 | 10002   |
| 4   | zr_id1 | zr_name1 | 10003   |
+-----+--------+----------+---------+

Upvotes: 3

Related Questions