Reputation: 47
I'm trying to exclude unnecessury rows from pandas dataframe with aggregating by one column. The dataframe looks like this:
df1=pd.DataFrame({'date':[20191121]*10,
'time':[100000]*10,
'last':[64131,64131,64130,64130,64130,64131,64131,64132,64130,64130],
'vol':[1]*10})
print(df1)
date time last vol
0 20191121 100000 64131 1
1 20191121 100000 64131 1
2 20191121 100000 64130 1
3 20191121 100000 64130 1
4 20191121 100000 64130 1
5 20191121 100000 64131 1
6 20191121 100000 64131 1
7 20191121 100000 64132 1
8 20191121 100000 64130 1
9 20191121 100000 64130 1
I would like to get the dataframe like this:
df2=pd.DataFrame({'date':[20191121]*5,
'time':[100000]*5,
'last':[64131,64130,64131,64132,64130],
'vol':[2,3,2,1,2]})
print(df2)
date time last vol
0 20191121 100000 64131 2
1 20191121 100000 64130 3
2 20191121 100000 64131 2
3 20191121 100000 64132 1
4 20191121 100000 64130 2
Could you help me to solve this task?
Upvotes: 2
Views: 71
Reputation: 863031
You can aggregate sum
, but is also necessary add helper Series for consecutive values of last
to groupby
:
g = df1['last'].ne(df1['last'].shift()).cumsum()
df = df1.groupby(['date','time','last', g], sort=False, as_index=False)['vol'].sum()
print(df)
date time last vol
0 20191121 100000 64131 2
1 20191121 100000 64130 3
2 20191121 100000 64131 2
3 20191121 100000 64132 1
4 20191121 100000 64130 2
If want working with consecutive values of all 3 columns:
c = ['date','time','last']
g = df1[c].ne(df1[c].shift()).any(axis=1).cumsum()
df = df1.groupby(c + [g], sort=False, as_index=False)['vol'].sum()
print(df)
date time last vol
0 20191121 100000 64131 2
1 20191121 100000 64130 3
2 20191121 100000 64131 2
3 20191121 100000 64132 1
4 20191121 100000 64130 2
Upvotes: 5