Reputation: 967
I have a dataframe that uses MultiIndex for both index and columns. For example:
df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
df[c] = np.random.randint(100, size=(12,1))
x 1 2
y 1 2 1 2
i j k
1 1 4 10 13 0 76
5 92 37 52 40
2 4 88 77 50 22
5 75 31 19 1
3 4 61 23 5 47
5 43 68 10 21
2 1 4 23 15 17 5
5 47 68 6 94
2 4 0 12 24 54
5 83 27 46 19
3 4 7 22 5 15
5 7 10 89 79
I want to group the values by a name in the index and by a name in the columns.
For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std()
of all entries in that 2D array.
For example, let's say I groupby ['i', 'x']
, one group would be with values of i=1
and x=1
. I want to compute std
for each of these 2D arrays and produce a DataFrame
with i
values as index and x
values as columns.
What is the best way to achieve this?
If I do stack()
to get x
as an index, I will still be computing several std()
instead of one as there will still be multiple columns.
Upvotes: 1
Views: 1006
Reputation: 10545
You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i
and x
as the indices of interest, it would work like this:
# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))
# for every sub-dataframe (every combination of top-level indices)
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)
for col in columns] for row in rows],
index = rows, columns = columns)
# show result
display(df_groupSD)
Output:
1 2
1 31.455115 25.433812
2 29.421699 33.748962
There may be better ways, of course.
Upvotes: 1
Reputation: 30971
Try the following code:
df.groupby(level=0).apply(lambda grp: grp.stack().std())
Upvotes: 1
Reputation: 29635
You can use stack
to put the 'y' level of column as index and then groupby
only i to get:
print (df.stack(level='y').groupby(['i']).std())
x 1 2
i
1 32.966811 23.933462
2 28.668825 28.541835
Upvotes: 1