Reputation: 8490
I have a newline-delimited JSON file. Is it possible to generate a schema using a tool like jq
? I've had some success with jq
in the past but haven't done something as complicated as this.
Here's the format of the schema I'm aiming for: https://cloud.google.com/bigquery/docs/nested-repeated#example_schema. Notice that nesting is handled with a fields
key of the parent, and arrays are handled with "mode": "repeated"
. (Any help with some sort of schema is greatly appreciated and I then can massage into this format).
Copying from the link above, I'd like to generate from this:
{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
...to...
[
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "first_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "last_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "dob",
"type": "DATE",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "status",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "address",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "city",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "state",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "zip",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "numberOfYears",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
(ref BigQuery autodetect doesn't work with inconsistent json?, showing that I can't use the BigQuery autodetect because the items aren't the same. I'm fairly confident I can merge schemas together manually to create a superset)
Upvotes: 3
Views: 3628
Reputation: 116780
Any help with some sort of schema is greatly appreciated and I then can massage into this format
There is a schema-inference module written in jq at http://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed but the inferred schemas are "structural" - they mirror the input JSON. For your sample, the inferred schema is as shown below. As you can see, it would be quite easy to transform this into the format you have in mind, except that extra work would be required to infer the mode
values.
Please note that the above-mentioned module infers the "common schema" from an arbitrarily large "sample" of JSON documents. That is, it is a schema inference engine rather than simply a "schema generator".
The above link references a companion schema-checker named JESS, also written in jq. The "E" in "JESS" stands for "extended", signifying that the JESS schema language for specifying schemas allows complex constraints to be included.
{
"id": "string",
"first_name": "string",
"last_name": "string",
"dob": "string",
"addresses": [
{
"status": "string",
"address": "string",
"city": "string",
"state": "string",
"zip": "string",
"numberOfYears": "string"
}
]
}
Upvotes: 1
Reputation: 14663
Here's a simple recursive function that may help if you decide to roll your own:
def schema:
def isdate($v): $v | test("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]");
def array($k;$v): {"name":$k,"type":"RECORD",mode:"REPEATED","fields":($v[0] | schema)};
def date($k): {"name":$k,"type":"DATE", mode:"NULLABLE"};
def string($k): {"name":$k,"type":"STRING",mode:"NULLABLE"};
def item($k;$v):
$v | if type == "array" then array($k;$v)
elif type == "string" and isdate($v) then date($k)
elif type == "string" then string($k)
else empty end;
[ to_entries[] | item(.key;.value) ]
;
schema
Upvotes: 2