Reputation: 28554
I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.
I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict
variant. Note that the original data has a lot more labels and more rows as well.
The idea is that the original data contains rows of a task with index Task_n
that has been performed by a participant with index Participant_n
. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:
Participant_n | Task_n | val | dur
----------------------------------
1 | 1 | 12 | 2
1 | 1 | 3 | 4
1 | 1 | 4 | 12
1 | 2 | 11 | 11
1 | 2 | 34 | 4
The above example contains one participants, two tasks, with respectively three and two segments (rows).
In Python, with a dict
structure this looks like this:
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
d = pd.DataFrame(data, columns=cols)
part_d = {}
for row in d.itertuples():
participant_n = row.Participant_n
participant = "participant" + str(participant_n)
task = "task" + str(row.Task_n)
if participant in part_d:
part_d[participant]['all_sum']['val'] += int(row.val)
part_d[participant]['all_sum']['dur'] += int(row.dur)
else:
part_d[participant] = {
'prof': 0 if participant_n < 20 else 1,
'all_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
if task in part_d[participant]:
# Get already existing keys
k = list(part_d[participant][task].keys())
k_int = []
# Only get the ints (i.e. not all_sum etc.)
for n in k:
# Get digit from e.g. seg1
n = n[3:]
try:
k_int.append(int(n))
except ValueError:
pass
# Increment max by 1
i = max(k_int) + 1
part_d[participant][task][f"seg{i}"] = {
'val': int(row.val),
'dur': int(row.dur),
}
part_d[participant][task]['task_sum']['val'] += int(row.val)
part_d[participant][task]['task_sum']['dur'] += int(row.dur)
else:
part_d[participant][task] = {
'seg1': {
'val': int(row.val),
'dur': int(row.dur),
},
'task_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
print(part_d)
In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof
which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):
{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}
Instead of a dictionary, I would like this to end up in a pd.DataFrame
with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1
or seg1
I just used the indices.)
Participant Prof all_sum Task Task_sum Seg val dur
val dur val dur
====================================================================
participant1 0 220 1240 1 38 1138 1 25 83
2 4 68
3 9 987
2 182 102 1 98 98
2 84 4
--------------------------------------------------------------------
participant2 0 235 49 1 9 21 1 9 21
2 218 16 1 15 6
2 185 6
3 18 4
3 8 12 1 8 12
--------------------------------------------------------------------
participant3 0 31 214 1 19 166 1 7 78
2 12 88
2 12 48 1 12 48
Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?
Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:
Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg
The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict
approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?
For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0
) should be without problems.
Upvotes: 8
Views: 521
Reputation: 28313
I faced a similar issue with data presentation and came up with the following helper functions for groupby with subtotals.
With this process it's possible to generate subtotals for an arbitrary number of group by columns, however the output data has a different format. Instead of the subtotals being put in their own columns, each subtotal adds an extra row to the data frame.
For interactive data exploration & analysis, I find this very helpful as its possible to get the subtotals with just a couple of lines of code
def get_subtotals(frame, columns, aggvalues, subtotal_level):
if subtotal_level == 0:
return frame.groupby(columns, as_index=False).agg(aggvalues)
elif subtotal_level == len(columns):
return pd.DataFrame(frame.agg(aggvalues)).transpose().assign(
**{c: np.nan for i, c in enumerate(columns)}
)
return frame.groupby(
columns[:subtotal_level],
as_index=False
).agg(aggvalues).assign(
**{c: np.nan for i, c in enumerate(columns[subtotal_level:])}
)
def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'):
gt = 1 if grand_totals else 0
out = pd.concat(
[get_subtotals(df, columns, aggvalues, i)
for i in range(len(columns)+gt)]
).sort_values(columns, na_position=totals_position)
out[columns] = out[columns].fillna('total')
return out.set_index(columns)
resuing the dataframe creation code from Gabriel A's answer
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
It is first necessary to add the seg
column
df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1
Then we can use groupby_with_subtotals
like this. Additionally, note that you can place the subtotals at the top and also include grand_totals by passing in grand_totals=True, totals_position='first'
groupby_columns = ['Participant_n', 'Task_n', 'seg']
groupby_aggs = {'val': 'sum', 'dur': 'sum'}
aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs)
aggdf
# outputs
dur val
Participant_n Task_n seg
1 1.0 1.0 83 25
2.0 68 4
3.0 987 9
total 1138 38
2.0 1.0 98 98
2.0 4 84
total 102 182
total total 1240 220
2 1.0 1.0 21 9
total 21 9
2.0 1.0 6 15
2.0 6 185
3.0 4 18
total 16 218
3.0 1.0 12 8
total 12 8
total total 49 235
3 1.0 1.0 78 7
2.0 88 12
total 166 19
2.0 1.0 48 12
total 48 12
total total 214 31
Here, the subtotals rows are marked with total
, and the left most total
indicates the subtotal level.
Once the aggregate data frame is created, its possible to access the subtotals using loc
. example:
aggdf.loc[1,'total','total']
# outputs:
dur 1240
val 220
Name: (1, total, total), dtype: int64
Upvotes: 2
Reputation: 1827
My only suggestion is to get rid of all your dictionary stuff. All of that code can be re-written in Pandas without much effort. This will likely speed up the transformation process as well but will take some time. To help you in the process I have rewritten the section you provided. The rest is up to you.
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum")
df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum")
df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1
df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum")
df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum")
df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]]
df = df.sort_index()
df
Output
seg val dur
Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur
1 220 1240 1 38 1138 1 25 83
1138 2 4 68
1138 3 9 987
2 182 102 1 98 98
102 2 84 4
2 235 49 1 9 21 1 9 21
2 218 16 1 15 6
16 2 185 6
16 3 18 4
3 8 12 1 8 12
3 31 214 1 19 166 1 7 78
166 2 12 88
2 12 48 1 12 48
Try to run this code and let me know what you think. Comment with any questions.
Upvotes: 6