Reputation: 110462
I have various highly nested json objects. I am wondering whether to store these as STRUCT
s in BigQuery or as a STRING
. If storing it as a string, then I can use JSON_EXTRACT
where necessary to get what I need. I have a few questions on using the following approach:
STRUCT
instead of a string give?Finally, I wasn't able to find any place in the documentation that gives examples of how to query STRUCT
s. The only place I could find was https://cloud.google.com/bigquery/docs/nested-repeated. Are there examples in the documentation (or elsewhere) on querying nested fields? Additionally, why is the term RECORD
and STRUCT
used interchangeably on this page?
Note that the json will not be repeated at the root level, i.e., it will look like {...}
and not [{...},{...}]
.
As a reference, in Redshift you would (as of this question) store json as a string and use the json-functions to manipulate it: https://stackoverflow.com/a/32731374/651174.
Upvotes: 10
Views: 9642
Reputation: 173171
I have various highly nested json objects
Unfortunately (as of this answer) BigQuery's JSON_EXTRACT has some quite limitations that does not allow processing JSON arrays
Obviously, there is workaround involving using JS UDF:
To overcome BigQuery "limitation" for JsonPath, you can use JS UDF as below example shows:
It uses jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js
#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
return JSON.stringify(jsonPath(parsed, json_path));
} catch (e) { return null }
"""
OPTIONS (
library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
SELECT '''
{ "store": {
"book": [
{ "category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{ "category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{ "category": "fiction",
"author": "Herman Melville",
"title": "Moby Dick",
"isbn": "0-553-21311-3",
"price": 8.99
},
{ "category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
}
}
''' AS x
)
SELECT
CUSTOM_JSON_EXTRACT(x, '$.store.book[*].author'),
CUSTOM_JSON_EXTRACT(x, '$..*[?(@.price==22.99)].author'),
CUSTOM_JSON_EXTRACT(x, '$..author'),
CUSTOM_JSON_EXTRACT(x, '$.store.*'),
CUSTOM_JSON_EXTRACT(x, '$.store..price'),
CUSTOM_JSON_EXTRACT(x, '$..book[(@.length-1)]'),
CUSTOM_JSON_EXTRACT(x, '$..book[-1:]'),
CUSTOM_JSON_EXTRACT(x, '$..book[0,1]'),
CUSTOM_JSON_EXTRACT(x, '$..book[:2]'),
CUSTOM_JSON_EXTRACT(x, '$..book[?(@.isbn)]')
FROM t
Result is as below
For CUSTOM_JSON_EXTRACT(x, '$.store.book[*].author')
[
"Nigel Rees"
"Evelyn Waugh"
"Herman Melville"
"J. R. R. Tolkien"
]
For CUSTOM_JSON_EXTRACT(x, '$..*[?(@.price==22.99)].author')
[
"J. R. R. Tolkien"
]
For CUSTOM_JSON_EXTRACT(x, '$.store..price')
[
8.95
12.99
8.99
22.99
19.95
]
Would it be a bad idea storing json data as a string instead of record?
So, based on above - I would say - use STRING when your JSON is very simple (which looks like not your case) otherwise store as STRUCTs, so yo will not need to do extractions each and every time using "expensive" JS UDF
Would there be a big performance hit whenever using that json field if it's stored as a string?
If you will need to use expensive JS UDF - YES
While if your JSON is simple enough and can be processed with JSON_EXTRACT - impact should not be visible
Upvotes: 3
Reputation: 59295
I usually do both:
My 3 steps:
Then, in case of schema change:
Upvotes: 23