Reputation: 1383
I need to group a pandas Dataframe according to the following criteria, it is like-ish a ohlc aggregation:
open = last where volume > 0, in case there is no entry with volume > 0 use overall last
high = max
low = min
last = last
volume = max
My current implementation for these kinds of operations (ohlc aggregation) is:
ohlc_dict = {
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum',
}
df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).agg(ohlc_dict)
How can I solve this? Thank you.
Sample:
fi ts open high low close volume
datetime
2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0
2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0
2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96213 0.96220 0.96213 0.96219 19
2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96219 0
2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96222 0
2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96221 0
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96223 0
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0
2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29
2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0
2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96222 0
Desired Output:
fi ts open high low close volume
datetime
2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96222 29
Additional Information:
There are two data sources which can be identified by their 'Volume' values:
a. Volume = 0 (more frequent, less reliable)
b. Volume > 0 (less frequent, more reliable)
As type 'b.' is more reliable, it is preferrable to use its open value to type 'a' open value.
As to whether last aggregation, it doesn't really matter to be honest, other aggregations (first,max,min) would work, as the open value is the first quoted value in a minute (on this example) and never changes.
The issues of incorrect values arise when there is an interruption on the connection to the server. Type 'a' data cannot deal with this and will give me possibly wrong values, type 'b' data can deal with this just fine and will give me correct values.
Upvotes: 1
Views: 83
Reputation: 862641
You can first aggregate by last
of open
column:
ohlc_dict = {
'high': 'max',
'low': 'min',
'close': 'last',
'open':'last',
'volume':'sum'
}
g = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df2 = g.agg(ohlc_dict)
print (df2)
low close high open volume
datetime
2017-11-17 12:35:00 0.96213 0.96222 0.96223 0.96215 77
Then filter out all 0
volumes and aggregate only last value of open
:
g1 = df[df['volume'] > 0].groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df1 = g1['open'].last().reindex(df2.index)
print (df1)
datetime
2017-11-17 12:35:00 0.96213
Freq: T, Name: open, dtype: float64
Last combine both DataFrames to one with to_frame
and combine_first
:
df3 = df1.to_frame().combine_first(df2)
print (df3)
close high low open volume
datetime
2017-11-17 12:35:00 0.96222 0.96223 0.96213 0.96213 77.0
Use custom function with condition (slowier):
def ohlc_func(x):
a = x.loc[x['volume'] > 0, 'open'].tail(1)
a = a.item() if len(a) == 1 else x['open'].tail(1)[0]
b = x['high'].max()
c = x['low'].min()
d = x['close'].tail(1)[0]
e = x['volume'].sum()
col = ['open','high','low','close','volume']
return pd.Series([a,b,c,d,e], index=col)
df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).apply(ohlc_func)
print (df)
open high low close volume
datetime
2017-11-17 12:35:00 0.96213 0.96223 0.96213 0.96222 77.0
Upvotes: 2
Reputation: 30605
You can use boolean indexing
for max volume and tail(1)
for last value of open inside apply since you have duplicate index i.e
ohlc_dict = {
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'max',
}
grp = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
ndf = grp.agg(ohlc_dict)
ndf['open'] = grp['open','volume'].apply(lambda x : x[x['volume'] == x['volume'].max()].tail(1)['open'])
Output :
low volume close high open datetime 2017-11-17 12:35:00 0.96213 29 0.96222 0.96223 0.96213
Upvotes: 2