Reputation: 180
Experts,
We have a scenario to infer schema from JSON data loaded in the table. It has to be done dynamically and also the JSON data in table would be of different schema.
Example:
row 1-> address <array>[ id string ,name string ]
row 2-> address<array> [addr<object> {id:"1",name:"abc"}]
row 3-> address<array> [addr<object> {id:"2",name:"dfg",Zips<array>[zip1:6009,zip2:789]}]
I am aware we can use LATERAL FLATTEN & recursive to infer the schema. However when we need to shred Zips the above data we need have flatten query as below.
LATERAL FLATTEN (jsondata:address ,recursive =>true) a
LATERAL FLATTEN (a.value:addr,recursive =>true) b -> this is causing issue
LATERAL FLATTEN (c.value:Zips,recursive =>true) c
When we flatten object data type it is flattening to element level, is there a way to check and dynamically avoid flattening of object.
Regards, Gopi
Upvotes: 0
Views: 3485
Reputation:
Snowflake' semi-structured data query features provide data-type inspection functions that can be used to conditionally handle such a varied input within a single table column.
In particular, after breaking down the outer arrays into whole rows, you can use IS_ARRAY
, IS_OBJECT
, and the :
operator (with NULL
result checks) functions to separate the record producing logic, and then combine the rows into a single output with a UNION ALL
.
The question lacks a clear/usable sample or schema of data (and an expected output) so I've made four assumptions below on what the data looks like and added a filter for each type from the root. The general idea remains the same for each (check type, divide dataset, process each type), you should be able to infer and adjust.
WITH tbl AS (
-- Sample table data
select parse_json('{"address": [["sa_id1", "sa_name1"], ["sa_id2", "sa_name2"]], "other_outer_field": 1}') jsondata
union all
select parse_json('{"address": [{"id": "sr_id1", "name": "sr_name1"}, {"id": "sr_id2", "name": "sr_name2"}], "other_outer_field": 2}') jsondata
union all
select parse_json('{"address": [{"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, {"id": "zr_id2", "name": "zr_name2", "zips": ["20001", "20002"]}], "other_outer_field": 3}') jsondata
union all
select parse_json('{"address": {"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, "other_outer_field": 4}') jsondata
), all_address_array_formats AS (
-- Table's actual row: { …, "address": [ … ], … } when the address field is an array
SELECT
jsondata:other_outer_field AS o_f,
each_address.value AS address_container
FROM tbl, LATERAL FLATTEN(jsondata:address) each_address
WHERE IS_ARRAY(jsondata:address)
), all_address_object_formats AS (
-- Table's actual row: { …, "address": { … }, … } when the address field is an object (we do not need to flatten here)
SELECT
jsondata:other_outer_field AS o_f,
jsondata:address AS address_container
FROM tbl
WHERE IS_OBJECT(jsondata:address)
), just_array_members AS (
-- For address array with nested arrays: [ [id1, name1], [id2, name2], … ]
SELECT
o_f,
address_container[0]::varchar AS id,
address_container[1]::varchar AS name,
NULL AS zipcode
FROM all_address_array_formats
WHERE
IS_ARRAY(address_container)
), simple_record_members AS (
-- For address array with objects, but no zipcode fields: [ { id: id1, name: name1 }, { id: id2, name: name2 }, … ]
SELECT
o_f,
address_container:id::varchar AS id,
address_container:name::varchar AS name,
NULL AS zipcode
FROM all_address_array_formats
WHERE
IS_OBJECT(address_container)
AND address_container:zips IS NULL
), zipcode_record_members AS (
-- For address array with objects, each with multiple zipcodes: [ { id: id1, name: name1, zips: [ zip1_1, zip1_2, … ] }, { id: id2, name: name2, zips: [zip2_1, zip2_2, …] }, … ]
SELECT
o_f,
address_container:id::varchar AS id,
address_container:name::varchar AS name,
per_zip.value::varchar AS zipcode
FROM all_address_array_formats, LATERAL FLATTEN(address_container:zips) per_zip
WHERE
IS_OBJECT(address_container)
AND address_container:zips IS NOT NULL
), zipcodes_within_object AS (
-- For address of object type, a single one with multiple zipcodes: { id: id1, name: name1, zips: [ zip1_1, zip1_2, … ] }
SELECT
o_f,
address_container:id::varchar AS id,
address_container:name::varchar AS name,
per_zip.value::varchar AS zipcode
FROM all_address_object_formats, LATERAL FLATTEN(address_container:zips) per_zip
WHERE
IS_OBJECT(address_container)
AND address_container:zips IS NOT NULL
)
SELECT o_f, id, name, zipcode FROM just_array_members UNION ALL
SELECT o_f, id, name, zipcode FROM simple_record_members UNION ALL
SELECT o_f, id, name, zipcode FROM zipcode_record_members UNION ALL
SELECT o_f, id, name, zipcode FROM zipcodes_within_object;
Note: The example also shows how to continue to carry any other fields apart from address
from the original object in the table (column: o_f
) that are not broken down via the flatten function.
For the input:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| JSONDATA |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| {"address": [["sa_id1", "sa_name1"], ["sa_id2", "sa_name2"]], "other_outer_field": 1} |
| {"address": [{"id": "sr_id1", "name": "sr_name1"}, {"id": "sr_id2", "name": "sr_name2"}], "other_outer_field": 2} |
| {"address": [{"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, {"id": "zr_id2", "name": "zr_name2", "zips": ["20001", "20002"]}], "other_outer_field": 3} |
| {"address": {"id": "zr_id1", "name": "zr_name1", "zips": ["10001", "10002", "10003"]}, "other_outer_field": 4} |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
This produces:
+-----+--------+----------+---------+
| O_F | ID | NAME | ZIPCODE |
|-----+--------+----------+---------|
| 1 | sa_id1 | sa_name1 | NULL |
| 1 | sa_id2 | sa_name2 | NULL |
| 2 | sr_id1 | sr_name1 | NULL |
| 2 | sr_id2 | sr_name2 | NULL |
| 3 | zr_id1 | zr_name1 | 10001 |
| 3 | zr_id1 | zr_name1 | 10002 |
| 3 | zr_id1 | zr_name1 | 10003 |
| 3 | zr_id2 | zr_name2 | 20001 |
| 3 | zr_id2 | zr_name2 | 20002 |
| 4 | zr_id1 | zr_name1 | 10001 |
| 4 | zr_id1 | zr_name1 | 10002 |
| 4 | zr_id1 | zr_name1 | 10003 |
+-----+--------+----------+---------+
Upvotes: 3