Convert dict constructor to Pandas MultiIndex dataframe

Question

I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.

I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict variant. Note that the original data has a lot more labels and more rows as well.

The idea is that the original data contains rows of a task with index Task_n that has been performed by a participant with index Participant_n. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:

Participant_n | Task_n | val | dur
----------------------------------
            1 |      1 |  12 |   2
            1 |      1 |   3 |   4
            1 |      1 |   4 |  12
            1 |      2 |  11 |  11
            1 |      2 |  34 |   4

The above example contains one participants, two tasks, with respectively three and two segments (rows).

In Python, with a dict structure this looks like this:

import pandas as pd

cols = ['Participant_n', 'Task_n', 'val', 'dur']

data = [[1,1,25,83],
        [1,1,4,68],
        [1,1,9,987],
        [1,2,98,98],
        [1,2,84,4],
        [2,1,9,21],
        [2,2,15,6],
        [2,2,185,6],
        [2,2,18,4],
        [2,3,8,12],
        [3,1,7,78],
        [3,1,12,88],
        [3,2,12,48]]

d = pd.DataFrame(data, columns=cols)

part_d = {}
for row in d.itertuples():
    participant_n = row.Participant_n
    participant = "participant" + str(participant_n)
    task = "task" + str(row.Task_n)

    if participant in part_d:
        part_d[participant]['all_sum']['val'] += int(row.val)
        part_d[participant]['all_sum']['dur'] += int(row.dur)
    else:
        part_d[participant] = {
            'prof': 0 if participant_n < 20 else 1,
            'all_sum': {
                'val': int(row.val),
                'dur': int(row.dur),
            }
        }

    if task in part_d[participant]:
        # Get already existing keys
        k = list(part_d[participant][task].keys())

        k_int = []
        # Only get the ints (i.e. not all_sum etc.)
        for n in k:
            # Get digit from e.g. seg1
            n = n[3:]
            try:
                k_int.append(int(n))
            except ValueError:
                pass

        # Increment max by 1
        i = max(k_int) + 1
        part_d[participant][task][f"seg{i}"] = {
            'val': int(row.val),
            'dur': int(row.dur),
        }
        part_d[participant][task]['task_sum']['val'] += int(row.val)
        part_d[participant][task]['task_sum']['dur'] += int(row.dur)
    else:
        part_d[participant][task] = {
            'seg1': {
                'val': int(row.val),
                'dur': int(row.dur),
            },
            'task_sum': {
                'val': int(row.val),
                'dur': int(row.dur),
            }
        }

print(part_d)

In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):

{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}

Instead of a dictionary, I would like this to end up in a pd.DataFrame with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1 or seg1 I just used the indices.)

Participant   Prof all_sum      Task    Task_sum     Seg   val   dur
                   val    dur           val    dur
====================================================================
participant1  0    220   1240      1     38   1138     1    25    83
                                                       2     4    68
                                                       3     9   987
                                   2    182    102     1    98    98
                                                       2    84     4
--------------------------------------------------------------------
participant2  0    235     49      1      9     21     1     9    21
                                   2    218     16     1    15     6
                                                       2   185     6
                                                       3    18     4
                                   3      8     12     1     8    12
--------------------------------------------------------------------
participant3  0     31    214      1     19    166     1     7    78
                                                       2    12    88
                                   2     12     48     1    12    48

Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?

Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:

Participant  Prof  all_sum_val  all_sum_dur  Task  Task_sum_val  Task_sum_dur  Seg

The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?

For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0) should be without problems.

Haleemur Ali · Accepted Answer

I faced a similar issue with data presentation and came up with the following helper functions for groupby with subtotals.

With this process it's possible to generate subtotals for an arbitrary number of group by columns, however the output data has a different format. Instead of the subtotals being put in their own columns, each subtotal adds an extra row to the data frame.

For interactive data exploration & analysis, I find this very helpful as its possible to get the subtotals with just a couple of lines of code

def get_subtotals(frame, columns, aggvalues, subtotal_level):

    if subtotal_level == 0:
        return frame.groupby(columns, as_index=False).agg(aggvalues)

    elif subtotal_level == len(columns):
        return pd.DataFrame(frame.agg(aggvalues)).transpose().assign(
            **{c: np.nan  for i, c in enumerate(columns)}
        )

    return frame.groupby(
        columns[:subtotal_level],
        as_index=False
    ).agg(aggvalues).assign(
        **{c: np.nan for i, c in enumerate(columns[subtotal_level:])}
    )

def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'):
    gt = 1 if grand_totals else 0
    out = pd.concat(   
        [get_subtotals(df, columns, aggvalues, i)
         for i in range(len(columns)+gt)]
     ).sort_values(columns, na_position=totals_position)
    out[columns] = out[columns].fillna('total')
    return out.set_index(columns)

resuing the dataframe creation code from Gabriel A's answer

cols = ['Participant_n', 'Task_n', 'val', 'dur']

data = [[1,1,25,83],
        [1,1,4,68],
        [1,1,9,987],
        [1,2,98,98],
        [1,2,84,4],
        [2,1,9,21],
        [2,2,15,6],
        [2,2,185,6],
        [2,2,18,4],
        [2,3,8,12],
        [3,1,7,78],
        [3,1,12,88],
        [3,2,12,48]]

df = pd.DataFrame(data, columns=cols)

It is first necessary to add the seg column

df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1

Then we can use groupby_with_subtotals like this. Additionally, note that you can place the subtotals at the top and also include grand_totals by passing in grand_totals=True, totals_position='first'

groupby_columns = ['Participant_n', 'Task_n', 'seg']
groupby_aggs = {'val': 'sum', 'dur': 'sum'}
aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs)
aggdf
# outputs

                             dur  val
Participant_n Task_n seg
1             1.0    1.0      83   25
                     2.0      68    4
                     3.0     987    9
                     total  1138   38
              2.0    1.0      98   98
                     2.0       4   84
                     total   102  182
              total  total  1240  220
2             1.0    1.0      21    9
                     total    21    9
              2.0    1.0       6   15
                     2.0       6  185
                     3.0       4   18
                     total    16  218
              3.0    1.0      12    8
                     total    12    8
              total  total    49  235
3             1.0    1.0      78    7
                     2.0      88   12
                     total   166   19
              2.0    1.0      48   12
                     total    48   12
              total  total   214   31

Here, the subtotals rows are marked with total, and the left most total indicates the subtotal level.

Once the aggregate data frame is created, its possible to access the subtotals using loc. example:

aggdf.loc[1,'total','total']
# outputs:
dur    1240
val     220
Name: (1, total, total), dtype: int64

Convert dict constructor to Pandas MultiIndex dataframe

Answers (2)

Related Questions