Maximilian
Maximilian

Reputation: 8490

Generate schema from json file using jq

I have a newline-delimited JSON file. Is it possible to generate a schema using a tool like jq? I've had some success with jq in the past but haven't done something as complicated as this.

Here's the format of the schema I'm aiming for: https://cloud.google.com/bigquery/docs/nested-repeated#example_schema. Notice that nesting is handled with a fields key of the parent, and arrays are handled with "mode": "repeated". (Any help with some sort of schema is greatly appreciated and I then can massage into this format).

Copying from the link above, I'd like to generate from this:

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}

...to...

[
    {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "first_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "last_name",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "dob",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "addresses",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
            {
                "name": "status",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "address",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "city",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "state",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "zip",
                "type": "STRING",
                "mode": "NULLABLE"
            },
            {
                "name": "numberOfYears",
                "type": "STRING",
                "mode": "NULLABLE"
            }
        ]
    }

]

(ref BigQuery autodetect doesn't work with inconsistent json?, showing that I can't use the BigQuery autodetect because the items aren't the same. I'm fairly confident I can merge schemas together manually to create a superset)

Upvotes: 3

Views: 3628

Answers (2)

peak
peak

Reputation: 116780

Any help with some sort of schema is greatly appreciated and I then can massage into this format

There is a schema-inference module written in jq at http://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed but the inferred schemas are "structural" - they mirror the input JSON. For your sample, the inferred schema is as shown below. As you can see, it would be quite easy to transform this into the format you have in mind, except that extra work would be required to infer the mode values.

Please note that the above-mentioned module infers the "common schema" from an arbitrarily large "sample" of JSON documents. That is, it is a schema inference engine rather than simply a "schema generator".

The above link references a companion schema-checker named JESS, also written in jq. The "E" in "JESS" stands for "extended", signifying that the JESS schema language for specifying schemas allows complex constraints to be included.

{
  "id": "string",
  "first_name": "string",
  "last_name": "string",
  "dob": "string",
  "addresses": [
    {
      "status": "string",
      "address": "string",
      "city": "string",
      "state": "string",
      "zip": "string",
      "numberOfYears": "string"
    }
  ]
}

Upvotes: 1

jq170727
jq170727

Reputation: 14663

Here's a simple recursive function that may help if you decide to roll your own:

def schema:
  def isdate($v):   $v | test("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]");
  def array($k;$v): {"name":$k,"type":"RECORD",mode:"REPEATED","fields":($v[0] | schema)};
  def date($k):     {"name":$k,"type":"DATE",  mode:"NULLABLE"};
  def string($k):   {"name":$k,"type":"STRING",mode:"NULLABLE"};
  def item($k;$v):
     $v | if   type == "array"                 then array($k;$v)
          elif type == "string" and isdate($v) then date($k)
          elif type == "string"                then string($k)
      else empty end;
  [ to_entries[] | item(.key;.value) ]
;
schema

Try it online!

Upvotes: 2

Related Questions