NathanC
NathanC

Reputation: 193

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:

{
key_a:value_a,
key_b:{
   key_c:value_c,
   key_d:value_d
  }
key_e:{
   key_f:value_f,
   key_g:value_g
  }
}

The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.

Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.

How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?

Upvotes: 8

Views: 18303

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.

If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode

Table schema:

[
  { "name": "key_a",
    "type": "STRING" },
  { "name": "key_b",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      { "name": "key",
        "type": "STRING"},
      { "name": "value",
        "type": "STRING"}
    ]
  },
  { "name": "key_e",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      { "name": "key",
        "type": "STRING"},
      { "name": "value",
        "type": "STRING"}
    ]
  }
]

Example of JSON to load

{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}

Please note: it should be newline delimited JSON so each row must be on one line

Upvotes: 11

Adam Lydick
Adam Lydick

Reputation: 1122

You can't do this directly with BigQuery, but you can make it work in two passes:

(1) Import your JSON data as a CSV file with a single string column.

(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.

Example

I'll start with some JSON:

{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}

Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'þ') or it tripped over the default ','.

Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.

Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:

a (INTEGER)
b (STRING)
c (STRING -- packed JSON)

We can do this with a UDF:

// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
  'extractAndRepack',                // Name of the function exported to SQL
  ['blob'],                          // Names of input columns
  [{'name': 'a', 'type': 'integer'}, // Output schema
   {'name': 'b', 'type': 'string'},
   {'name': 'c', 'type': 'string'}],
  function (row, emit) {
    var parsed = JSON.parse(row.blob);
    var repacked = JSON.stringify(parsed.c);
    emit({a: parsed.a, b: parsed.b, c: repacked});
  }
);

And a corresponding query:

SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)

Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.

Row a   b       c    
1   0   zero    {"woodchuck":"a"}    
2   1   one     {"chipmunk":"b"}     
3   2   two     {"squirrel":"c"}     
4   3   three   {"chinchilla":"d"}   
5   4   four    {"capybara":"e"}     
6   5   five    {"housemouse":"f"}   
7   6   six     {"molerat":"g"}  
8   7   seven   {"marmot":"h"}   
9   8   eight   {"badger":"i"}   

Upvotes: 5

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14004

One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.

P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.

Upvotes: 3

Related Questions