Kshitij Agrawal
Kshitij Agrawal

Reputation: 233

sum of dynamic columns based on certain condition

I am trying to get sum of dynamic columns based on certain condition.

cols = ['ID','A','B','C','D','E','F','G']

dataframe df has all the columns listed above.

If ID = 2, I need sum of first two columns A, B IF ID = 3, I need sum of first three columns A,B,C

df.loc['SUM'] = df.loc[df['ID'] > 0,cols[0:df['ID']]].sum(axis=1)

above line of code is giving an error that :

TypeError: slice indices must be integers or None or have an __index__ method

Note : ID can be any number but it would be always less than or equal to total number of columns

There are no nan and missing values in dataframe.

Upvotes: 5

Views: 267

Answers (2)

B. M.
B. M.

Reputation: 18628

Another way :

with df :

    0  1  2  3  4  5  6  7   
ID                              
4   6  1  4  4  8  4  6  3  
7   5  8  7  9  9  2  7  8  
4   8  9  2  6  9  5  4  1 
3   4  6  1  3  4  9  2  4  
5   4  4  8  1  2  1  5  8  

sums are computed by df.cumsum(axis=1).values[range(len(df)),df.index-1]

it's 5x faster than df.where(np.arange(df.shape[1]) <df.index.values[:,None]).sum(axis=1).

Upvotes: 1

cs95
cs95

Reputation: 402413

You can vectorise this thanks to the power of NumPy:

df.set_index('ID', inplace=True)
df.wherenp.arange(df.shape[1]) < df.index.values[:,None]).sum(axis=1)    

Minimal Reproducible Example

import pandas as pd

pd.__version__
# '0.24.2'

np.random.seed(0) 
df = pd.DataFrame(np.random.randint(1, 10, (5, 8))) 
df.insert(0, 'ID', np.random.randint(1, 8, 5))  
df                                           

   ID  0  1  2  3  4  5  6  7
0   4  6  1  4  4  8  4  6  3
1   7  5  8  7  9  9  2  7  8
2   4  8  9  2  6  9  5  4  1
3   3  4  6  1  3  4  9  2  4
4   5  4  4  8  1  2  1  5  8

df.set_index('ID', inplace=True)
df.assign(
    SUM=df.where(np.arange(df.shape[1]) < df.index.values[:,None]).sum(axis=1))

    0  1  2  3  4  5  6  7   SUM
ID                              
4   6  1  4  4  8  4  6  3  15.0
7   5  8  7  9  9  2  7  8  47.0
4   8  9  2  6  9  5  4  1  25.0
3   4  6  1  3  4  9  2  4  11.0
5   4  4  8  1  2  1  5  8  19.0

The where function will hide values that we don't want to sum over:

df.where(np.arange(df.shape[1]) < df.index[:,None])   

    0  1  2    3    4    5    6   7
ID                                 
4   6  1  4  4.0  NaN  NaN  NaN NaN
7   5  8  7  9.0  9.0  2.0  7.0 NaN
4   8  9  2  6.0  NaN  NaN  NaN NaN
3   4  6  1  NaN  NaN  NaN  NaN NaN
5   4  4  8  1.0  2.0  NaN  NaN NaN

Thanks to @Quang Hoang for pointing out a potential bug in the solution.

Upvotes: 3

Related Questions