Aggregate row values in Dataframe under specific condition

Question

I have a dataframe which has two columns. I want to build the sum of all successive rows where value of column 0 is -1.

My dataframe looks sth. like this:

And the desired output should be this:

the values in the second column are all equal or greater than zero. The values on first column are equal or greater than -1, if this helps. My approach would be a loop, where I create second dataframe and then push back every value not equal to -1 and accumulate, when I found a -1, but I imagine, this approach would not be efficient. Pseudo code:

sum = 0
found = False
for row in dataframe:
   if row[0] != -1:
       if found:
           new_df.append([-1, sum])
           sum = 0
           found = False
       new_df.append(row)
   elif row[0] == -1:
       found = True
       sum += row[1]

Is there a build-in python or pandas function I can use to achive my goal?

jezrael · Accepted Answer

In my opinion here is necessary create Series for 100% sure if never values of helper groups are same like values outside -1, so added 0.5 to index for distiguish:

df = df.reset_index(drop=True)

m = df['a'] == -1
s = m.ne(m.shift()).cumsum()[m].reindex(df.index).fillna(df.index.to_series().add(.5))
df = df.groupby(s).agg({'a':'first', 'b':'sum'}).reset_index(drop=True)
print (df)
   a   b
0  0   2
1  1   3
2 -1  11
3 -1   4
4  0   2
5  5   0

Explanation:

df = pd.DataFrame({'a': [0, 1, -1, -1, 0, -1, -1, -1, 5],
                   'b': [2, 3, 4, 7, 2, 0, 1, 3, 0]})
print (df)
   a  b
0  0  2
1  1  3
2 -1  4
3 -1  7
4  0  2
5 -1  0
6 -1  1
7 -1  3
8  5  0

First create default index if necessary, because unique index values are used in solution:

df = df.reset_index(drop=True)

Then create consecutive groups for -1 and another values:

m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum()))
   a  b  groups
0  0  2       1
1  1  3       1
2 -1  4       2
3 -1  7       2
4  0  2       3
5 -1  0       4
6 -1  1       4
7 -1  3       4
8  5  0       5

Then filter only -1 values with boolean indexing (by mask b) and another not matched values convert to NaNs by Series.reindex:

m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum(),
                 filtered = m.ne(m.shift()).cumsum()[m].reindex(df.index)))

   a  b  groups  filtered
0  0  2       1       NaN
1  1  3       1       NaN
2 -1  4       2       2.0
3 -1  7       2       2.0
4  0  2       3       NaN
5 -1  0       4       4.0
6 -1  1       4       4.0
7 -1  3       4       4.0
8  5  0       5       NaN

Then replace missing values by index values with 0.5 - never colise between groups for -1 and values of replaced NaNs:

m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum(),
                 filtered = m.ne(m.shift()).cumsum()[m].reindex(df.index),
                 idx = df.index.to_series().add(.5),
                 groups1 = m.ne(m.shift()).cumsum()[m].reindex(df.index).fillna(df.index.to_series().add(.5))))

   a  b  groups  filtered  idx  groups1
0  0  2       1       NaN  0.5      0.5
1  1  3       1       NaN  1.5      1.5
2 -1  4       2       2.0  2.5      2.0
3 -1  7       2       2.0  3.5      2.0
4  0  2       3       NaN  4.5      4.5
5 -1  0       4       4.0  5.5      4.0
6 -1  1       4       4.0  6.5      4.0
7 -1  3       4       4.0  7.5      4.0
8  5  0       5       NaN  8.5      8.5

Then is helper Series passed to groupby and aggregate sum for second column and first for first column by GroupBy.agg and last is removed index by DataFrame.reset_index with drop=True:

df = df.groupby(s).agg({'a':'first', 'b':'sum'}).reset_index(drop=True)
print (df)
   a   b
0  0   2
1  1   3
2 -1  11
3 -1   4
4  0   2
5  5   0

Another solution, simplier and also with better performance:

df = df.reset_index(drop=True)

m = df['a'] == -1
s = df.reset_index()
      .groupby(m.ne(m.shift()).cumsum()[m])
      .agg({'index':'first', 'b':'sum'})
      .set_index('index')
      .assign(a = -1)

df = df[~m].append(s, sort=True).sort_index()
print (df)
   a   b
0  0   2
1  1   3
2 -1  11
4  0   2
5 -1   4
8  5   0

Explanation:

First is necessary default index:

df = df.reset_index(drop=True)

Then compare column by -1 to boolean mask:

m = df['a'] == -1

Convert index to column by reset_index with no drop parameter:

print (df.reset_index())
   index  a  b
0      0  0  2
1      1  1  3
2      2 -1  4
3      3 -1  7
4      4  0  2
5      5 -1  0
6      6 -1  1
7      7 -1  3
8      8  5  0

Create consecutive groups with shift and cumsum and filter by mask for -1 groups:

print (m.ne(m.shift()).cumsum()[m])
2    2
3    2
5    4
6    4
7    4
Name: a, dtype: int32

Aggregate first by index column and sum by b column:

print (df.reset_index()
        .groupby(m.ne(m.shift()).cumsum()[m])
        .agg({'index':'first', 'b':'sum'}))

     index   b
a             
2.0      2  11
4.0      5   4

Convert index column to index by DataFrame.set_index:

print(df.reset_index()
      .groupby(m.ne(m.shift()).cumsum()[m])
      .agg({'index':'first', 'b':'sum'})
      .set_index('index'))
        b
index    
2      11
5       4

Add a column with constant -1 by DataFrame.assign:

s = (df.reset_index()
      .groupby(m.ne(m.shift()).cumsum()[m])
      .agg({'index':'first', 'b':'sum'})
      .set_index('index')
      .assign(a = -1))
print (s)
        b  a
index       
2      11 -1
5       4 -1

Last filter out -1 rows by boolean indexing with inverse mask by ~:

print (df[~m])
  a  b
0  0  2
1  1  3
4  0  2
8  5  0

Then add new data to original by DataFrame.append:

print (df[~m].append(s, sort=True))
   a   b
0  0   2
1  1   3
4  0   2
8  5   0
2 -1  11
5 -1   4

And last DataFrame.sort_index for same ordering:

print (df[~m].append(s, sort=True).sort_index())
   a   b
0  0   2
1  1   3
2 -1  11
4  0   2
5 -1   4
8  5   0

Aggregate row values in Dataframe under specific condition

Answers (1)

Related Questions