Summing rows from a MultiIndex pandas df based on index label

Question

If I have a pandas.DataFrame which looks like this:

         Probability
0 1 2 3          
C H C C  0.058619
H H C H  0.073274
C H C H  0.195398
C H H H  0.113982
C H H C  0.009770
C C C C  0.087929
H C H H  0.005234
H H H C  0.003664
H H C C  0.021982
C C H C  0.004187
H C H C  0.000449
C C H H  0.048849
H C C C  0.009421
H H H H  0.042743
C C C H  0.293096
H C C H  0.031403

The index is a tuple of len(4) and its order corresponds to all sequences of length four and a 2 letter alphabet ['H','C']. What is the best way to sum the rows with a H in position [i for i in df.index] ?

df.ix['H'].sum() is functional but I can't see how to generalize to the 'any case' method. For example, I need to be able to perform the same calculation regardless of how long the sequence is or with more than a 2 letter alphabet. Moreover, the df.ix['H'] is difficult because it doesn't seem to accept wild cards, i.e. df.ix['*','H'] for all sequences with H in index position 1. Does anybody have any suggestions? thanks

pansen · Accepted Answer

Setup

To create a dummy data frame which corresponds to your provided example, I used the following:

import pandas as pd
import numpy as np
import random

# define sequence and target
sequence = ["H", "C"]
target = "H"

# define shapes
size_col = 4
size_row = 100

# create dummy data and dummy columns
array_indices = np.random.choice(sequence, size=(size_row, size_col))
array_value = np.random.random(size=(size_row, 1))
array = np.concatenate([array_indices, array_value], axis=1)

col_indices = ["Idx {}".format(x) for x in range(size_col)]
col_values = ["Probability"]
columns = col_indices + col_values

# create pandas data frame
df = pd.DataFrame(array, columns=columns)
df[col_values] = df[col_values].astype(float)

The resulting pandas.DataFrame looks like this:

>>> print(df.head())

Idx 0   Idx 1   Idx 2   Idx 3   Probability
  C       C       C       H     0.892125
  C       H       C       H     0.633699
  C       C       C       C     0.228546
  H       C       H       C     0.766639
  C       H       C       C     0.379930

The only difference to your data frame is the reset index (you get the same when using df.reset_index()).

Solution

Now, to get the sums of the rows with a target value for all indices, you may use the following:

bool_indices = df[col_indices] == target
result = bool_indices.apply(lambda x: df.loc[x, col_values].sum())

First, you create a new data frame with boolean values which correspond to each index column containing the target value for each row.

Second, you use these boolean series as index columns to define a subset of your actual value column and finally apply an arbitrary method like sum() on it.

The result is the following:

>>> print(result)

                   Idx 0       Idx 1      Idx 2       Idx 3
Probability     23.246007   23.072544   24.775996   24.683079

This solution is flexible in regard to your input sequence, the target and the shape of your data.

In addition, if you want to use slicing with wildcards, you can use the pandas.IndexSlice on your original data frame example like:

idx = pd.IndexSlice

# to get all rows which have the "H" at second index
df.loc[idx[:, "H"], :]

# to get all rows which have the "H" at third index
df.loc[idx[:, :, "H"], :]

Summing rows from a MultiIndex pandas df based on index label

Answers (2)

Setup

Solution

Related Questions