Generate schema from json file using jq

Question

I have a newline-delimited JSON file. Is it possible to generate a schema using a tool like jq? I've had some success with jq in the past but haven't done something as complicated as this.

Here's the format of the schema I'm aiming for: https://cloud.google.com/bigquery/docs/nested-repeated#example_schema. Notice that nesting is handled with a fields key of the parent, and arrays are handled with "mode": "repeated". (Any help with some sort of schema is greatly appreciated and I then can massage into this format).

Copying from the link above, I'd like to generate from this:

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}

...to...

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }

]

(ref BigQuery autodetect doesn't work with inconsistent json?, showing that I can't use the BigQuery autodetect because the items aren't the same. I'm fairly confident I can merge schemas together manually to create a superset)

jq170727 · Accepted Answer

Here's a simple recursive function that may help if you decide to roll your own:

def schema:
  def isdate($v):   $v | test("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]");
  def array($k;$v): {"name":$k,"type":"RECORD",mode:"REPEATED","fields":($v[0] | schema)};
  def date($k):     {"name":$k,"type":"DATE",  mode:"NULLABLE"};
  def string($k):   {"name":$k,"type":"STRING",mode:"NULLABLE"};
  def item($k;$v):
     $v | if   type == "array"                 then array($k;$v)
          elif type == "string" and isdate($v) then date($k)
          elif type == "string"                then string($k)
      else empty end;
  [ to_entries[] | item(.key;.value) ]
;
schema

Try it online!

Generate schema from json file using jq

Answers (2)

Related Questions