m01010011
m01010011

Reputation: 1112

How to flatten / unnest / normalize nested JSON column in DuckDB (i.e., split nested fields into separate columns)?

Column types:

D describe some_table;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ BIGINT      │ YES     │         │         │         │
│ data        │ JSON        │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Sample data:

D select id, data from some_table;
┌────────┬──────────────────────────────────────────────────────────────────────────────────┐
│   id   │                                    data                                          │
│ int64  │                                    json                                          │
├────────┼──────────────────────────────────────────────────────────────────────────────────┤
│      1 │ {"type":"abc","purpose":"ad","ts":"...","userId":"","context":{"ip":"x.x.x.x",...│
│      2 │ {"type":"abc","purpose":"search","ts":"...","userId":"ABCD1234","context":{"ip...│
│      3 │ {"type":"defghi","purpose":null,"ts":"...","userId":"","context":{"ip":"x.x.x....│
...
└───────────────────────────────────────────────────────────────────────────────────────────┘

Expectation:

┌──────────┬─────────┬─────┬──────────┬─────────────┬──────────────────┐
│ type     │ purpose │ ts  │ userId   │ context.ip  │ context.sth.else │
├──────────┼─────────┼─────┼──────────┼─────────────┼──────────────────┤
│ abc      │ ad      │ ... │          │             │                  │
│ abc      │ search  │ ... │          │             │                  │
│ defghi   │ null    │ ... │ ABCD1234 │             │                  │
└──────────┴─────────┴─────┴──────────┴─────────────┴──────────────────┘

I tried using unnest but it works only with structs / lists:

D select unnest(data) from some_table limit 5;
Error: Binder Error: UNNEST() can only be applied to lists, structs and NULL
LINE 1: select unnest(data) from some_table limit 5;
            ^

So, I then tried unnest with from_json based on Unnest JSON Array into Rows (pseudo-json_each), but that didn't work either:

D select unnest(from_json(data, '"JSON"')) from some_table limit 5;
Error: Binder Error: UNNEST() can only be applied to lists, structs and NULL
LINE 1: select unnest(from_json(data, '"JSON"')) from ...
            ^

I've tried searching for this solution and came across these:

But they all seem to assume the data is stored in JSON files, whereas I've read this table from a CSV file. I want to flatten an existing JSON column.

Alternatively, if I could convert this JSON column into a STRUCT column, that would allow me to use the unnest function. But I couldn't find any online references for that, either.

TL;DR - I'm looking for DuckDB equivalent of pandas.json_normalize.

Upvotes: 7

Views: 3970

Answers (1)

jqurious
jqurious

Reputation: 21404

As you say, it seems that there is no way to get read_json() behaviour (schema inference) without writing the JSON column out to a tempfile.

Converting to a struct appears to require specifying a schema.

duckdb.sql("from tbl select json_structure(data) limit 1").fetchone()[0]
# '{"type":"VARCHAR","purpose":"VARCHAR","ts":"VARCHAR","userId":"VARCHAR","context":{"ip":"VARCHAR"}}'
schema = duckdb.sql("from tbl select json_structure(data) limit 1").fetchone()[0]

duckdb.sql(f"from tbl select json_transform(data, '{schema}') as data")
# ┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
# │                                              data                                               │
# │ struct("type" varchar, purpose varchar, ts varchar, userid varchar, context struct(ip varchar)) │
# ├─────────────────────────────────────────────────────────────────────────────────────────────────┤
# │ {'type': abc, 'purpose': ad, 'ts': ..., 'userId': , 'context': {'ip': x.x.x.x}}                 │
# │ {'type': abc, 'purpose': search, 'ts': ..., 'userId': ABCD1234, 'context': {'ip': y.y.y.y}}     │
# │ {'type': defghi, 'purpose': NULL, 'ts': ..., 'userId': , 'context': {'ip': z.z.z.z}}            │
# └─────────────────────────────────────────────────────────────────────────────────────────────────┘

Example setup using Python API

import duckdb
import io
import tempfile

raw_csv = io.BytesIO(b'''
id,data
1,"{""type"":""abc"",""purpose"":""ad"",""ts"":""..."",""userId"":"""",""context"":{""ip"":""x.x.x.x""}}"
2,"{""type"":""abc"",""purpose"":""search"",""ts"":""..."",""userId"":""ABCD1234"",""context"":{""ip"":""y.y.y.y""}}"
3,"{""type"":""defghi"",""purpose"":null,""ts"":""..."",""userId"":"""",""context"":{""ip"":""z.z.z.z""}}"
'''.strip()
)

tmpfile = tempfile.NamedTemporaryFile()

tbl = duckdb.read_csv(raw_csv)

Tempfile

duckdb.sql(f"copy (from tbl select data::json as data) to '{tmpfile.name}' (format json)")
duckdb.sql(f"from read_json('{tmpfile.name}')")
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                              data                                               │
│ struct("type" varchar, purpose varchar, ts varchar, userid varchar, context struct(ip varchar)) │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'type': abc, 'purpose': ad, 'ts': ..., 'userId': , 'context': {'ip': x.x.x.x}}                 │
│ {'type': abc, 'purpose': search, 'ts': ..., 'userId': ABCD1234, 'context': {'ip': y.y.y.y}}     │
│ {'type': defghi, 'purpose': NULL, 'ts': ..., 'userId': , 'context': {'ip': z.z.z.z}}            │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘

Positional join

duckdb.sql(f"""
from read_json('{tmpfile.name}') j
positional join tbl
select
   tbl.* exclude (data),
   unnest(j.data, recursive := true)
""")
┌───────┬─────────┬─────────┬─────────┬──────────┬─────────┐
│  id   │  type   │ purpose │   ts    │  userId  │   ip    │
│ int64 │ varchar │ varchar │ varchar │ varchar  │ varchar │
├───────┼─────────┼─────────┼─────────┼──────────┼─────────┤
│     1 │ abc     │ ad      │ ...     │          │ x.x.x.x │
│     2 │ abc     │ search  │ ...     │ ABCD1234 │ y.y.y.y │
│     3 │ defghi  │ NULL    │ ...     │          │ z.z.z.z │
└───────┴─────────┴─────────┴─────────┴──────────┴─────────┘

However, I can't any way to retain the "path" in a recursive unnest - it just uses the final "key" as the column name.

It seems a feature request for this is needed.

Upvotes: 6

Related Questions