Reputation: 3166
I have a dataframe with a MultiIndex.
01.01 02.01 03.01 04.01
bar total1 40 52 18 11
total2 36 85 5 92
baz total1 23 39 45 70
total2 50 49 51 65
foo total1 23 97 17 97
total2 64 56 94 45
qux total1 13 73 38 4
total2 80 8 61 50
df.index.values
results in:
array([('bar', 'total1'), ('bar', 'total2'), ('baz', 'total1'),
('baz', 'total2'), ('foo', 'total1'), ('foo', 'total2'),
('qux', 'total1'), ('qux', 'total2')], dtype=object)
I am ultimately looking to transform the dataframe into a dictionary of dictionaries such that the first dict keys are one of ['bar','baz', 'foo','qux']
and values are the dates and the inner dictionary is made of 'total1' and 'totals2' as key and the values are the integers of the df.
Alternative explanation, is for example if dict1
is the dict then calling dict1['bar']
would result in the output:
{'bar':{'01.01':{'total1':40,'total2':36},'02.01':{'total1':52,'total2':85},'03.01':{'total1':18,'total2':5},'04.01':{'total1':11,'total2':92} } }
How and what would I need to alter in order to achieve this? Is this an indexing issue?
Upvotes: 33
Views: 23608
Reputation: 23071
piRSquared's answer is great if the number of groups is small but if there are many groups, it is very slow because groupby.apply
has to make calls to a Python function many times.1
If there are many groups, a faster way to convert a dataframe into a nested dictionary is to simply loop over the dataframe using itertuples
and construct a dictionary on the fly. Because we need a nested dictionary, dict.setdefault()
or collections.defaultdict
will be helpful.
dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
for col, val in zip(df.columns, vals):
dict2.setdefault(idx1, {}).setdefault(col, {})[idx2] = val
dict.setdefault()
here is helping us to insert a key with an empty dictionary as a value if it doesn't exist. It produces the same result as the following more verbose code.
dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
for col, val in zip(df.columns, vals):
if idx1 in dict2:
if col in dict2[idx1]:
dict2[idx1][col][idx2] = val
else:
dict2[idx1][col] = {idx2: val}
else:
dict2[idx1] = {col: {idx2: val}}
1 Benchmark to support the claim made above (tested on Python 3.11.5, pandas 2.2.0):
idx = pd.MultiIndex.from_product((range(1000), range(10)))
df = pd.DataFrame({c: range(10000) for c in 'ABCD'}, index=idx)
%%timeit
dict1 = df.groupby(level=0).apply(lambda df: df.xs(df.name).to_dict()).to_dict()
# 849 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
for col, val in zip(df.columns, vals):
dict2.setdefault(idx1, {}).setdefault(col, {})[idx2] = val
# 50.3 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see from the benchmark above if there are 1000 groups, then the straightforward loop is 16 times faster then groupby.apply
solution. This gap widens as the number of groups increase.
On the other hand, if there are few groups (<10) then both approaches are very fast so performance probably wouldn’t be an issue anymore.
More benchmarks on this point can be found at Convert Pandas Dataframe to nested JSON.
Upvotes: 2
Reputation: 294218
For converting whole dataframe to dictionary Try:
df.groupby(level=0).apply(lambda df: df.xs(df.name).to_dict()).to_dict()
{'bar': {'01.01': {'total1': 40, 'total2': 36},
'02.01': {'total1': 52, 'total2': 85},
'03.01': {'total1': 18, 'total2': 5},
'04.01': {'total1': 11, 'total2': 92}},
'baz': {'01.01': {'total1': 23, 'total2': 50},
'02.01': {'total1': 39, 'total2': 49},
'03.01': {'total1': 45, 'total2': 51},
'04.01': {'total1': 70, 'total2': 65}},
'foo': {'01.01': {'total1': 23, 'total2': 64},
'02.01': {'total1': 97, 'total2': 56},
'03.01': {'total1': 17, 'total2': 94},
'04.01': {'total1': 97, 'total2': 45}},
'qux': {'01.01': {'total1': 13, 'total2': 80},
'02.01': {'total1': 73, 'total2': 8},
'03.01': {'total1': 38, 'total2': 61},
'04.01': {'total1': 4, 'total2': 50}}}
For converting one particular column, select before converting it to dictionary i.e
df.groupby(level=0).apply(lambda df: df.xs(df.name)[colname].to_dict()).to_dict()
Upvotes: 38