Reputation: 693
My df looks like this:
country id x y
AT 11 50 100
AT 12 NaN 90
AT 13 NaN 104
AT 22 40 50
AT 23 30 23
AT 61 40 88
AT 62 NaN 78
UK 11 40 34
UK 12 NaN 22
UK 13 NaN 70
What I need is the sum of the y column in the first row that is not NaN in x, grouped by the first number on the left of the column id. This separately for each country. At the end I just need to drop the NaN.
The result should be something like this:
country id x y
AT 11 50 294
AT 22 40 50
AT 23 30 23
AT 61 40 166
UK 11 40 126
Upvotes: 0
Views: 443
Reputation: 22503
Use groupby
, transform
and dropna
:
print (df.assign(y=df.groupby(df["x"].notnull().cumsum())["y"].transform('sum'))
.dropna(subset=["x"]))
country id x y
0 AT 11 50.0 294
3 AT 22 40.0 50
4 AT 23 30.0 23
5 AT 61 40.0 166
7 UK 11 40.0 126
Upvotes: 0
Reputation: 863651
You can aggregate by GroupBy.agg
by first
and sum
functions with helper Series
by compare non missing values by Series.notna
and cumulative sum by Series.cumsum
:
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
country id x y
0 AT 11 50.0 294
1 AT 22 40.0 50
2 AT 23 30.0 23
3 AT 61 40.0 166
4 UK 11 40.0 126
If possible first value(s) of x
are misisng values add DataFrame.dropna
:
print (df)
country id x y
0 AT 11 NaN 100
1 AT 11 50.0 100
2 AT 12 NaN 90
3 AT 13 NaN 104
4 AT 22 40.0 50
5 AT 23 30.0 23
6 AT 61 40.0 88
7 AT 62 NaN 78
8 UK 11 40.0 34
9 UK 12 NaN 22
10 UK 13 NaN 70
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index()
.dropna(subset=['x']))
print (df1)
country id x y
1 AT 11 50.0 294
2 AT 22 40.0 50
3 AT 23 30.0 23
4 AT 61 40.0 166
5 UK 11 40.0 126
Upvotes: 2