Suresh
Suresh

Reputation: 967

Pandas: Groupby names in index and columns

I have a dataframe that uses MultiIndex for both index and columns. For example:

df = pd.DataFrame(index=pd.MultiIndex.from_product([[1,2], [1,2,3], [4,5]], names=['i','j', 'k']), columns=pd.MultiIndex.from_product([[1,2], [1,2]], names=['x', 'y']))
for c in df.columns:
    df[c] = np.random.randint(100, size=(12,1))

x       1       2
y       1   2   1   2
i j k
1 1 4  10  13   0  76
    5  92  37  52  40
  2 4  88  77  50  22
    5  75  31  19   1
  3 4  61  23   5  47
    5  43  68  10  21
2 1 4  23  15  17   5
    5  47  68   6  94
  2 4   0  12  24  54
    5  83  27  46  19
  3 4   7  22   5  15
    5   7  10  89  79

I want to group the values by a name in the index and by a name in the columns. For each such group, we will have a 2D array of numbers (rather than a Series). I want to aggregate std() of all entries in that 2D array.

For example, let's say I groupby ['i', 'x'], one group would be with values of i=1 and x=1. I want to compute std for each of these 2D arrays and produce a DataFrame with i values as index and x values as columns.

What is the best way to achieve this?

If I do stack() to get x as an index, I will still be computing several std() instead of one as there will still be multiple columns.

Upvotes: 1

Views: 1006

Answers (3)

Arne
Arne

Reputation: 10545

You can use nested list comprehensions. For your example, with the given kind of DataFrame (not the same, as the values are random; you may want to fix a seed value so that results are comparable) and i and x as the indices of interest, it would work like this:

# get values of the top level row index
rows = set(df.index.get_level_values(0))
# get values of the top level column index
columns = set(df.columns.get_level_values(0))

# for every sub-dataframe (every combination of top-level indices) 
# compute sampling standard deviation (1 degree of freedom) across all values
df_groupSD = pd.DataFrame([[df.loc[(row, )][(col, )].values.std(ddof=1)  
                            for col in columns] for row in rows], 
                          index = rows, columns = columns)

# show result
display(df_groupSD)  

Output:

    1           2
1   31.455115   25.433812
2   29.421699   33.748962

There may be better ways, of course.

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

Try the following code:

df.groupby(level=0).apply(lambda grp: grp.stack().std())

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

You can use stack to put the 'y' level of column as index and then groupby only i to get:

print (df.stack(level='y').groupby(['i']).std())
x          1          2
i                      
1  32.966811  23.933462
2  28.668825  28.541835

Upvotes: 1

Related Questions