Reputation: 7681
If I have a pandas.DataFrame
which looks like this:
Probability
0 1 2 3
C H C C 0.058619
H H C H 0.073274
C H C H 0.195398
C H H H 0.113982
C H H C 0.009770
C C C C 0.087929
H C H H 0.005234
H H H C 0.003664
H H C C 0.021982
C C H C 0.004187
H C H C 0.000449
C C H H 0.048849
H C C C 0.009421
H H H H 0.042743
C C C H 0.293096
H C C H 0.031403
The index is a tuple of len(4)
and its order corresponds to all sequences of length four and a 2 letter alphabet ['H','C']
. What is the best way to sum the rows with a H
in position [i for i in df.index]
?
df.ix['H'].sum()
is functional but I can't see how to generalize to the 'any case' method. For example, I need to be able to perform the same calculation regardless of how long the sequence is or with more than a 2 letter alphabet. Moreover, the df.ix['H']
is difficult because it doesn't seem to accept wild cards, i.e. df.ix['*','H']
for all sequences with H
in index position 1. Does anybody have any suggestions? thanks
Upvotes: 3
Views: 967
Reputation: 7681
An alternative solution from what pansen suggested is to use pandas.groupby
levels=[0,1,2,3]
for i in range(levels):
for j in df.groupby(level=i):
MI=pandas.MultiIndex.from_product([i,j[0]])
val= float(j[1].sum())
df_l.append( pandas.DataFrame([val],index=MI))
return pandas.concat(df_l)
Upvotes: 0
Reputation: 6663
To create a dummy data frame which corresponds to your provided example, I used the following:
import pandas as pd
import numpy as np
import random
# define sequence and target
sequence = ["H", "C"]
target = "H"
# define shapes
size_col = 4
size_row = 100
# create dummy data and dummy columns
array_indices = np.random.choice(sequence, size=(size_row, size_col))
array_value = np.random.random(size=(size_row, 1))
array = np.concatenate([array_indices, array_value], axis=1)
col_indices = ["Idx {}".format(x) for x in range(size_col)]
col_values = ["Probability"]
columns = col_indices + col_values
# create pandas data frame
df = pd.DataFrame(array, columns=columns)
df[col_values] = df[col_values].astype(float)
The resulting pandas.DataFrame
looks like this:
>>> print(df.head())
Idx 0 Idx 1 Idx 2 Idx 3 Probability
C C C H 0.892125
C H C H 0.633699
C C C C 0.228546
H C H C 0.766639
C H C C 0.379930
The only difference to your data frame is the reset index (you get the same when using df.reset_index()
).
Now, to get the sums of the rows with a target value for all indices, you may use the following:
bool_indices = df[col_indices] == target
result = bool_indices.apply(lambda x: df.loc[x, col_values].sum())
First, you create a new data frame with boolean values which correspond to each index column containing the target value for each row.
Second, you use these boolean series as index columns to define a subset of your actual value column and finally apply an arbitrary method like sum() on it.
The result is the following:
>>> print(result)
Idx 0 Idx 1 Idx 2 Idx 3
Probability 23.246007 23.072544 24.775996 24.683079
This solution is flexible in regard to your input sequence, the target and the shape of your data.
In addition, if you want to use slicing with wildcards, you can use the pandas.IndexSlice on your original data frame example like:
idx = pd.IndexSlice
# to get all rows which have the "H" at second index
df.loc[idx[:, "H"], :]
# to get all rows which have the "H" at third index
df.loc[idx[:, :, "H"], :]
Upvotes: 1