OAK
OAK

Reputation: 3166

DataFrame with MultiIndex to nested dictionary

I have a dataframe with a MultiIndex.

             01.01  02.01  03.01  04.01
bar total1     40     52     18     11
    total2     36     85      5     92
baz total1     23     39     45     70
    total2     50     49     51     65
foo total1     23     97     17     97
    total2     64     56     94     45
qux total1     13     73     38      4
    total2     80      8     61     50

df.index.values results in:

array([('bar', 'total1'), ('bar', 'total2'), ('baz', 'total1'),
       ('baz', 'total2'), ('foo', 'total1'), ('foo', 'total2'),
       ('qux', 'total1'), ('qux', 'total2')], dtype=object)

I am ultimately looking to transform the dataframe into a dictionary of dictionaries such that the first dict keys are one of ['bar','baz', 'foo','qux'] and values are the dates and the inner dictionary is made of 'total1' and 'totals2' as key and the values are the integers of the df. Alternative explanation, is for example if dict1 is the dict then calling dict1['bar'] would result in the output:

{'bar':{'01.01':{'total1':40,'total2':36},'02.01':{'total1':52,'total2':85},'03.01':{'total1':18,'total2':5},'04.01':{'total1':11,'total2':92} } }

How and what would I need to alter in order to achieve this? Is this an indexing issue?

Upvotes: 33

Views: 23608

Answers (2)

cottontail
cottontail

Reputation: 23071

piRSquared's answer is great if the number of groups is small but if there are many groups, it is very slow because groupby.apply has to make calls to a Python function many times.1

If there are many groups, a faster way to convert a dataframe into a nested dictionary is to simply loop over the dataframe using itertuples and construct a dictionary on the fly. Because we need a nested dictionary, dict.setdefault() or collections.defaultdict will be helpful.

dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
    for col, val in zip(df.columns, vals):
        dict2.setdefault(idx1, {}).setdefault(col, {})[idx2] = val

dict.setdefault() here is helping us to insert a key with an empty dictionary as a value if it doesn't exist. It produces the same result as the following more verbose code.

dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
    for col, val in zip(df.columns, vals):
        if idx1 in dict2:
            if col in dict2[idx1]:
                dict2[idx1][col][idx2] = val
            else:
                dict2[idx1][col] = {idx2: val}
        else:
            dict2[idx1] = {col: {idx2: val}}

1 Benchmark to support the claim made above (tested on Python 3.11.5, pandas 2.2.0):

idx = pd.MultiIndex.from_product((range(1000), range(10)))
df = pd.DataFrame({c: range(10000) for c in 'ABCD'}, index=idx)

%%timeit
dict1 = df.groupby(level=0).apply(lambda df: df.xs(df.name).to_dict()).to_dict()
# 849 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dict2 = {}
for (idx1, idx2), *vals in df.itertuples():
    for col, val in zip(df.columns, vals):
        dict2.setdefault(idx1, {}).setdefault(col, {})[idx2] = val
# 50.3 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see from the benchmark above if there are 1000 groups, then the straightforward loop is 16 times faster then groupby.apply solution. This gap widens as the number of groups increase.

On the other hand, if there are few groups (<10) then both approaches are very fast so performance probably wouldn’t be an issue anymore.

More benchmarks on this point can be found at Convert Pandas Dataframe to nested JSON.

Upvotes: 2

piRSquared
piRSquared

Reputation: 294218

For converting whole dataframe to dictionary Try:

df.groupby(level=0).apply(lambda df: df.xs(df.name).to_dict()).to_dict()

{'bar': {'01.01': {'total1': 40, 'total2': 36},
  '02.01': {'total1': 52, 'total2': 85},
  '03.01': {'total1': 18, 'total2': 5},
  '04.01': {'total1': 11, 'total2': 92}},
 'baz': {'01.01': {'total1': 23, 'total2': 50},
  '02.01': {'total1': 39, 'total2': 49},
  '03.01': {'total1': 45, 'total2': 51},
  '04.01': {'total1': 70, 'total2': 65}},
 'foo': {'01.01': {'total1': 23, 'total2': 64},
  '02.01': {'total1': 97, 'total2': 56},
  '03.01': {'total1': 17, 'total2': 94},
  '04.01': {'total1': 97, 'total2': 45}},
 'qux': {'01.01': {'total1': 13, 'total2': 80},
  '02.01': {'total1': 73, 'total2': 8},
  '03.01': {'total1': 38, 'total2': 61},
  '04.01': {'total1': 4, 'total2': 50}}}

For converting one particular column, select before converting it to dictionary i.e

df.groupby(level=0).apply(lambda df: df.xs(df.name)[colname].to_dict()).to_dict()

Upvotes: 38

Related Questions