Reputation: 3001
I have a dataframe
which has two columns. I want to build the sum of all successive rows where value of column 0
is -1
.
My dataframe
looks sth. like this:
0 2
1 3
-1 4
-1 7
0 2
-1 0
-1 1
-1 3
5 0
And the desired output should be this:
0 2
1 3
-1 11
0 2
-1 4
5 0
the values in the second column are all equal or greater than zero. The values on first column are equal or greater than -1
, if this helps. My approach would be a loop, where I create second dataframe
and then push back every value not equal to -1
and accumulate, when I found a -1
, but I imagine, this approach would not be efficient.
Pseudo code:
sum = 0
found = False
for row in dataframe:
if row[0] != -1:
if found:
new_df.append([-1, sum])
sum = 0
found = False
new_df.append(row)
elif row[0] == -1:
found = True
sum += row[1]
Is there a build-in python or pandas function I can use to achive my goal?
Upvotes: 2
Views: 731
Reputation: 862761
In my opinion here is necessary create Series
for 100% sure if never values of helper groups are same like values outside -1
, so added 0.5
to index
for distiguish:
df = df.reset_index(drop=True)
m = df['a'] == -1
s = m.ne(m.shift()).cumsum()[m].reindex(df.index).fillna(df.index.to_series().add(.5))
df = df.groupby(s).agg({'a':'first', 'b':'sum'}).reset_index(drop=True)
print (df)
a b
0 0 2
1 1 3
2 -1 11
3 -1 4
4 0 2
5 5 0
Explanation:
df = pd.DataFrame({'a': [0, 1, -1, -1, 0, -1, -1, -1, 5],
'b': [2, 3, 4, 7, 2, 0, 1, 3, 0]})
print (df)
a b
0 0 2
1 1 3
2 -1 4
3 -1 7
4 0 2
5 -1 0
6 -1 1
7 -1 3
8 5 0
First create default index if necessary, because unique index values are used in solution:
df = df.reset_index(drop=True)
Then create consecutive groups for -1
and another values:
m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum()))
a b groups
0 0 2 1
1 1 3 1
2 -1 4 2
3 -1 7 2
4 0 2 3
5 -1 0 4
6 -1 1 4
7 -1 3 4
8 5 0 5
Then filter only -1
values with boolean indexing
(by mask b
) and another not matched values convert to NaN
s by Series.reindex
:
m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum(),
filtered = m.ne(m.shift()).cumsum()[m].reindex(df.index)))
a b groups filtered
0 0 2 1 NaN
1 1 3 1 NaN
2 -1 4 2 2.0
3 -1 7 2 2.0
4 0 2 3 NaN
5 -1 0 4 4.0
6 -1 1 4 4.0
7 -1 3 4 4.0
8 5 0 5 NaN
Then replace missing values by index values with 0.5
- never colise between groups for -1
and values of replaced NaN
s:
m = df['a'] == -1
print (df.assign(groups = m.ne(m.shift()).cumsum(),
filtered = m.ne(m.shift()).cumsum()[m].reindex(df.index),
idx = df.index.to_series().add(.5),
groups1 = m.ne(m.shift()).cumsum()[m].reindex(df.index).fillna(df.index.to_series().add(.5))))
a b groups filtered idx groups1
0 0 2 1 NaN 0.5 0.5
1 1 3 1 NaN 1.5 1.5
2 -1 4 2 2.0 2.5 2.0
3 -1 7 2 2.0 3.5 2.0
4 0 2 3 NaN 4.5 4.5
5 -1 0 4 4.0 5.5 4.0
6 -1 1 4 4.0 6.5 4.0
7 -1 3 4 4.0 7.5 4.0
8 5 0 5 NaN 8.5 8.5
Then is helper Series
passed to groupby
and aggregate sum
for second column and first
for first column by GroupBy.agg
and last is removed index by DataFrame.reset_index
with drop=True
:
df = df.groupby(s).agg({'a':'first', 'b':'sum'}).reset_index(drop=True)
print (df)
a b
0 0 2
1 1 3
2 -1 11
3 -1 4
4 0 2
5 5 0
Another solution, simplier and also with better performance:
df = df.reset_index(drop=True)
m = df['a'] == -1
s = df.reset_index()
.groupby(m.ne(m.shift()).cumsum()[m])
.agg({'index':'first', 'b':'sum'})
.set_index('index')
.assign(a = -1)
df = df[~m].append(s, sort=True).sort_index()
print (df)
a b
0 0 2
1 1 3
2 -1 11
4 0 2
5 -1 4
8 5 0
Explanation:
First is necessary default index:
df = df.reset_index(drop=True)
Then compare column by -1
to boolean mask:
m = df['a'] == -1
Convert index to column by reset_index
with no drop
parameter:
print (df.reset_index())
index a b
0 0 0 2
1 1 1 3
2 2 -1 4
3 3 -1 7
4 4 0 2
5 5 -1 0
6 6 -1 1
7 7 -1 3
8 8 5 0
Create consecutive groups with shift
and cumsum
and filter by mask for -1
groups:
print (m.ne(m.shift()).cumsum()[m])
2 2
3 2
5 4
6 4
7 4
Name: a, dtype: int32
Aggregate first
by index column and sum
by b
column:
print (df.reset_index()
.groupby(m.ne(m.shift()).cumsum()[m])
.agg({'index':'first', 'b':'sum'}))
index b
a
2.0 2 11
4.0 5 4
Convert index
column to index
by DataFrame.set_index
:
print(df.reset_index()
.groupby(m.ne(m.shift()).cumsum()[m])
.agg({'index':'first', 'b':'sum'})
.set_index('index'))
b
index
2 11
5 4
Add a
column with constant -1
by DataFrame.assign
:
s = (df.reset_index()
.groupby(m.ne(m.shift()).cumsum()[m])
.agg({'index':'first', 'b':'sum'})
.set_index('index')
.assign(a = -1))
print (s)
b a
index
2 11 -1
5 4 -1
Last filter out -1
rows by boolean indexing
with inverse mask by ~
:
print (df[~m])
a b
0 0 2
1 1 3
4 0 2
8 5 0
Then add new data to original by DataFrame.append
:
print (df[~m].append(s, sort=True))
a b
0 0 2
1 1 3
4 0 2
8 5 0
2 -1 11
5 -1 4
And last DataFrame.sort_index
for same ordering:
print (df[~m].append(s, sort=True).sort_index())
a b
0 0 2
1 1 3
2 -1 11
4 0 2
5 -1 4
8 5 0
Upvotes: 1