Reputation: 1437
I'm trying to convert a data frame into nested/hierarchical data that will be written out as JSON lines. The data are structured like this:
df = pl.DataFrame({
"group_id": ["a", "a", "a", "b", "b", "b"],
"label": ["dog", "cat", "mouse", "dog", "cat", "mouse"],
"indicator": [1, 1, 0, 0, 0, 1]
})
df
┌──────────┬───────┬───────────┐
│ group_id ┆ label ┆ indicator │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════╪═══════╪═══════════╡
│ a ┆ dog ┆ 1 │
│ a ┆ cat ┆ 1 │
│ a ┆ mouse ┆ 0 │
│ b ┆ dog ┆ 0 │
│ b ┆ cat ┆ 0 │
│ b ┆ mouse ┆ 1 │
└──────────┴───────┴───────────┘
I'm trying to find a way to combine the "label" and "indicator" columns into a single dictionary (struct) per "group_id", where "label" are the keys and "indicator" the items. The result should look like this:
target = pl.DataFrame({
"group_id": ["a", "b"],
"label": [{"dog": 1, "cat": 1, "mouse": 0}, {"dog": 0, "cat": 0, "mouse": 1}],
})
target
┌──────────┬───────────┐
│ group_id ┆ label │
│ --- ┆ --- │
│ str ┆ struct[3] │
╞══════════╪═══════════╡
│ a ┆ {1,1,0} │
│ b ┆ {0,0,1} │
└──────────┴───────────┘
target["label"][0]
{'dog': 1, 'cat': 1, 'mouse': 0}
target.write_ndjson()
'{"group_id":"a","label":{"dog":1,"cat":1,"mouse":0}}\n{"group_id":"b","label":{"dog":0,"cat":0,"mouse":1}}\n'
Upvotes: 1
Views: 1059
Reputation: 21580
Perhaps there is a simpler way, but it looks like a .pivot()
(df.pivot(index="group_id", columns="label", values="indicator", aggregate_function=None)
.select("group_id", label=pl.struct(pl.exclude("group_id")))
# .write_ndjson()
)
shape: (2, 2)
┌──────────┬───────────┐
│ group_id ┆ label │
│ --- ┆ --- │
│ str ┆ struct[3] │
╞══════════╪═══════════╡
│ a ┆ {1,1,0} │
│ b ┆ {0,0,1} │
└──────────┴───────────┘
'{"group_id":"a","label":{"dog":1,"cat":1,"mouse":0}}\n{"group_id":"b","label":{"dog":0,"cat":0,"mouse":1}}\n'
Upvotes: 1