Reputation: 110462

Storing JSON in BigQuery

I have various highly nested json objects. I am wondering whether to store these as STRUCTs in BigQuery or as a STRING. If storing it as a string, then I can use JSON_EXTRACT where necessary to get what I need. I have a few questions on using the following approach:

Would it be a bad idea storing json data as a string instead of record?
Would there be a big performance hit whenever using that json field if it's stored as a string?
What additional advantages would storing the json as a STRUCT instead of a string give?

Finally, I wasn't able to find any place in the documentation that gives examples of how to query STRUCTs. The only place I could find was https://cloud.google.com/bigquery/docs/nested-repeated. Are there examples in the documentation (or elsewhere) on querying nested fields? Additionally, why is the term RECORD and STRUCT used interchangeably on this page?

Note that the json will not be repeated at the root level, i.e., it will look like {...} and not [{...},{...}].

As a reference, in Redshift you would (as of this question) store json as a string and use the json-functions to manipulate it: https://stackoverflow.com/a/32731374/651174.

Upvotes: 10

Answers (2)

Mikhail Berlyant

Reputation: 173171

I have various highly nested json objects

Unfortunately (as of this answer) BigQuery's JSON_EXTRACT has some quite limitations that does not allow processing JSON arrays

Obviously, there is workaround involving using JS UDF:

To overcome BigQuery "limitation" for JsonPath, you can use JS UDF as below example shows:
It uses jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js

#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
    try { var parsed = JSON.parse(json);
        return JSON.stringify(jsonPath(parsed, json_path));
    } catch (e) { return null }
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
SELECT '''
{ "store": {
        "book": [ 
            { "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
            },
            { "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
            },
            { "category": "fiction",
                "author": "Herman Melville",
                "title": "Moby Dick",
                "isbn": "0-553-21311-3",
                "price": 8.99
            },
            { "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    }
}
''' AS x
)
SELECT 
    CUSTOM_JSON_EXTRACT(x, '$.store.book[*].author'),
    CUSTOM_JSON_EXTRACT(x, '$..*[?(@.price==22.99)].author'),
    CUSTOM_JSON_EXTRACT(x, '$..author'),
    CUSTOM_JSON_EXTRACT(x, '$.store.*'),
    CUSTOM_JSON_EXTRACT(x, '$.store..price'),
    CUSTOM_JSON_EXTRACT(x, '$..book[(@.length-1)]'),
    CUSTOM_JSON_EXTRACT(x, '$..book[-1:]'),
    CUSTOM_JSON_EXTRACT(x, '$..book[0,1]'),
    CUSTOM_JSON_EXTRACT(x, '$..book[:2]'),
    CUSTOM_JSON_EXTRACT(x, '$..book[?(@.isbn)]')
FROM t

Result is as below

For CUSTOM_JSON_EXTRACT(x, '$.store.book[*].author')

[
  "Nigel Rees"
  "Evelyn Waugh"
  "Herman Melville"
  "J. R. R. Tolkien"
]

For CUSTOM_JSON_EXTRACT(x, '$..*[?(@.price==22.99)].author')

[
  "J. R. R. Tolkien"
]

For CUSTOM_JSON_EXTRACT(x, '$.store..price')

Would it be a bad idea storing json data as a string instead of record?

So, based on above - I would say - use STRING when your JSON is very simple (which looks like not your case) otherwise store as STRUCTs, so yo will not need to do extractions each and every time using "expensive" JS UDF

Would there be a big performance hit whenever using that json field if it's stored as a string?

If you will need to use expensive JS UDF - YES
While if your JSON is simple enough and can be processed with JSON_EXTRACT - impact should not be visible

Upvotes: 3

Felipe Hoffa

Reputation: 59295

I usually do both:

Store JSON objects as STRINGs for posterity and re-factorings.
Materialize easy-to-query tables from your JSON objects - to get you and your team a better experience when querying.

My 3 steps:

Store everything as JSON strings. Then you won't lose data in case of schema changes, for example.
Create a VIEW that JSON_EXTRACTs data into easy to query columns.
Materialize those views into tables for the best performance and ease.

Then, in case of schema change:

Everything you have stored, stays the same.
You can modify the views to suit the new schema.
You can re-materialize tables into the new schema.

Upvotes: 23

Storing JSON in BigQuery

Answers (2)

Related Questions