Reputation: 1112
Column types:
D describe some_table;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id │ BIGINT │ YES │ │ │ │
│ data │ JSON │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Sample data:
D select id, data from some_table;
┌────────┬──────────────────────────────────────────────────────────────────────────────────┐
│ id │ data │
│ int64 │ json │
├────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ {"type":"abc","purpose":"ad","ts":"...","userId":"","context":{"ip":"x.x.x.x",...│
│ 2 │ {"type":"abc","purpose":"search","ts":"...","userId":"ABCD1234","context":{"ip...│
│ 3 │ {"type":"defghi","purpose":null,"ts":"...","userId":"","context":{"ip":"x.x.x....│
...
└───────────────────────────────────────────────────────────────────────────────────────────┘
Expectation:
┌──────────┬─────────┬─────┬──────────┬─────────────┬──────────────────┐
│ type │ purpose │ ts │ userId │ context.ip │ context.sth.else │
├──────────┼─────────┼─────┼──────────┼─────────────┼──────────────────┤
│ abc │ ad │ ... │ │ │ │
│ abc │ search │ ... │ │ │ │
│ defghi │ null │ ... │ ABCD1234 │ │ │
└──────────┴─────────┴─────┴──────────┴─────────────┴──────────────────┘
I tried using unnest
but it works only with structs / lists:
D select unnest(data) from some_table limit 5;
Error: Binder Error: UNNEST() can only be applied to lists, structs and NULL
LINE 1: select unnest(data) from some_table limit 5;
^
So, I then tried unnest
with from_json
based on Unnest JSON Array into Rows (pseudo-json_each), but that didn't work either:
D select unnest(from_json(data, '"JSON"')) from some_table limit 5;
Error: Binder Error: UNNEST() can only be applied to lists, structs and NULL
LINE 1: select unnest(from_json(data, '"JSON"')) from ...
^
I've tried searching for this solution and came across these:
But they all seem to assume the data is stored in JSON files, whereas I've read this table from a CSV file. I want to flatten an existing JSON column.
Alternatively, if I could convert this JSON column into a STRUCT column, that would allow me to use the unnest
function. But I couldn't find any online references for that, either.
TL;DR - I'm looking for DuckDB equivalent of pandas.json_normalize
.
Upvotes: 7
Views: 3970
Reputation: 21404
As you say, it seems that there is no way to get read_json()
behaviour (schema inference) without writing the JSON column out to a tempfile.
Converting to a struct appears to require specifying a schema.
duckdb.sql("from tbl select json_structure(data) limit 1").fetchone()[0]
# '{"type":"VARCHAR","purpose":"VARCHAR","ts":"VARCHAR","userId":"VARCHAR","context":{"ip":"VARCHAR"}}'
schema = duckdb.sql("from tbl select json_structure(data) limit 1").fetchone()[0]
duckdb.sql(f"from tbl select json_transform(data, '{schema}') as data")
# ┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
# │ data │
# │ struct("type" varchar, purpose varchar, ts varchar, userid varchar, context struct(ip varchar)) │
# ├─────────────────────────────────────────────────────────────────────────────────────────────────┤
# │ {'type': abc, 'purpose': ad, 'ts': ..., 'userId': , 'context': {'ip': x.x.x.x}} │
# │ {'type': abc, 'purpose': search, 'ts': ..., 'userId': ABCD1234, 'context': {'ip': y.y.y.y}} │
# │ {'type': defghi, 'purpose': NULL, 'ts': ..., 'userId': , 'context': {'ip': z.z.z.z}} │
# └─────────────────────────────────────────────────────────────────────────────────────────────────┘
import duckdb
import io
import tempfile
raw_csv = io.BytesIO(b'''
id,data
1,"{""type"":""abc"",""purpose"":""ad"",""ts"":""..."",""userId"":"""",""context"":{""ip"":""x.x.x.x""}}"
2,"{""type"":""abc"",""purpose"":""search"",""ts"":""..."",""userId"":""ABCD1234"",""context"":{""ip"":""y.y.y.y""}}"
3,"{""type"":""defghi"",""purpose"":null,""ts"":""..."",""userId"":"""",""context"":{""ip"":""z.z.z.z""}}"
'''.strip()
)
tmpfile = tempfile.NamedTemporaryFile()
tbl = duckdb.read_csv(raw_csv)
duckdb.sql(f"copy (from tbl select data::json as data) to '{tmpfile.name}' (format json)")
duckdb.sql(f"from read_json('{tmpfile.name}')")
┌─────────────────────────────────────────────────────────────────────────────────────────────────┐
│ data │
│ struct("type" varchar, purpose varchar, ts varchar, userid varchar, context struct(ip varchar)) │
├─────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'type': abc, 'purpose': ad, 'ts': ..., 'userId': , 'context': {'ip': x.x.x.x}} │
│ {'type': abc, 'purpose': search, 'ts': ..., 'userId': ABCD1234, 'context': {'ip': y.y.y.y}} │
│ {'type': defghi, 'purpose': NULL, 'ts': ..., 'userId': , 'context': {'ip': z.z.z.z}} │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
duckdb.sql(f"""
from read_json('{tmpfile.name}') j
positional join tbl
select
tbl.* exclude (data),
unnest(j.data, recursive := true)
""")
┌───────┬─────────┬─────────┬─────────┬──────────┬─────────┐
│ id │ type │ purpose │ ts │ userId │ ip │
│ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────┼─────────┼─────────┼─────────┼──────────┼─────────┤
│ 1 │ abc │ ad │ ... │ │ x.x.x.x │
│ 2 │ abc │ search │ ... │ ABCD1234 │ y.y.y.y │
│ 3 │ defghi │ NULL │ ... │ │ z.z.z.z │
└───────┴─────────┴─────────┴─────────┴──────────┴─────────┘
However, I can't any way to retain the "path" in a recursive unnest - it just uses the final "key" as the column name.
It seems a feature request for this is needed.
Upvotes: 6