Rodrigo
Rodrigo

Reputation: 67

How to sum variable ranges in a pandas column to another column

I'm relatively new to pandas and I don't know the best approach to solve my problem. Well, I have a df with: an index, and the data in a column called 'Data' and an empty column called 'sum'.

I need help to create a function to add the sum of the variable group of rows of the 'Data' column in the column 'sum'. The grouping criteria is that there should not be empty rows in the group.

Here an example:

index  Data Sum
0       1   
1       1   2
2       
3       
4       1   
5       1   
6       1   3
7       
8       1   
9       1   2
10      
11      1   
12      1   
13      1   
14      1   
15      1   5   
16  
17      1   1
18  
19      1   1
20  

As you see, the length of each group of data in 'Data' is variable, could be only one row or any number of rows. Always the sum must be at the end of the group. As an example: the sum of the group of rows 4,5,6 of the 'Data' column should be at row 6 in the 'sum' column.

any insight will be appreciated.

UPDATE

The problem was solved by implementing the Method 3 suggested by ansev. However due to a change in the main program, the sum of each block, now need to be at the beggining of each one (in case the block has more than one row). Then I use the df = df.iloc[::-1] instruction twice in order to reverse the column and back again to normal. Thank you very much!!!!!

df = df.iloc[::-1]
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['Sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
df = df.iloc[::-1]

print(df)

Data  Sum
0    1.0  2.0
1    1.0  NaN
2    NaN  NaN
3    NaN  NaN
4    1.0  3.0
5    1.0  NaN
6    1.0  NaN
7    NaN  NaN
8    1.0  2.0
9    1.0  NaN
10   NaN  NaN
11   1.0  5.0
12   1.0  NaN
13   1.0  NaN
14   1.0  NaN
15   1.0  NaN
16   NaN  NaN
17   1.0  1.0
18   NaN  NaN
19   1.0  1.0
20   NaN  NaN

Upvotes: 3

Views: 1242

Answers (3)

ansev
ansev

Reputation: 30940

We can use GroupBy.cumsum:

# if you need replace blanks
#df = df.replace(r'^\s*$', np.nan, regex=True)
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
print(df)
    index  Data  sum
0       0   1.0  NaN
1       1   1.0  2.0
2       2   NaN  NaN
3       3   NaN  NaN
4       4   1.0  NaN
5       5   1.0  NaN
6       6   1.0  3.0
7       7   NaN  NaN
8       8   1.0  NaN
9       9   1.0  2.0
10     10   NaN  NaN
11     11   1.0  NaN
12     12   1.0  NaN
13     13   1.0  NaN
14     14   1.0  NaN
15     15   1.0  5.0
16     16   NaN  NaN
17     17   1.0  1.0
18     18   NaN  NaN
19     19   1.0  1.0
20     20   NaN  NaN

Method 2

#df = df.drop(columns='index') #if neccesary
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))

Method 3

Series.duplicated and Series.mask

blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)

as you can see the methods only differ in the way of masking the values ​​we don't need from the sum column.

We can also use .transform('sum') instead .cumsum()

performance with the sample dataframe

%%timeit
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
4.52 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
8.52 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
3.02 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 5

DYZ
DYZ

Reputation: 57135

Create a new column that is equal to the index in the data gaps and undefined, otherwise:

df.loc[:, 'Sum'] = np.where(df.Data.isnull(), df.index, np.nan)

Fill the column backward, count the lengths of the identically labeled spans, redefine the column:

df.Sum = df.groupby(df.Sum.bfill()).count()

Align the new column with the original data:

df.Sum = df.Sum.shift(-1)

Eliminate 0-length spans:

df.loc[df.Sum == 0, 'Sum'] = np.nan

Upvotes: 0

Lovesh Dongre
Lovesh Dongre

Reputation: 1344

Code Used for replication

import numpy as np
data = {'Data':  [1,1, np.nan , np.nan,1, 1, 1,np.nan , 1,1,np.nan,1,1,1,1,1,np.nan,1,np.nan,1,np.nan]}

df = pd.DataFrame (data)

Iterative Approach Solution

count = 0
for i in range(df.shape[0]):
    if df.iloc[i, 0] == 1:
        count += 1
    elif i != 0 and count != 0:
        df.at[i - 1, 'Sum'] = count
        print(count)
        count = 0

Upvotes: 1

Related Questions