joaoavf
joaoavf

Reputation: 1383

Grouping By Using a Conditional Statement from a Different Column

I need to group a pandas Dataframe according to the following criteria, it is like-ish a ohlc aggregation:

open = last where volume > 0, in case there is no entry with volume > 0 use overall last
high = max
low = min
last = last
volume = max

My current implementation for these kinds of operations (ohlc aggregation) is:

ohlc_dict = {
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum',
}

df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).agg(ohlc_dict)

How can I solve this? Thank you.

Sample:

                     fi  ts     open     high      low    close  volume
datetime                                                               
2017-11-17 12:35:00   0   0  0.96214  0.96214  0.96214  0.96214       0
2017-11-17 12:35:00   0   0  0.96214  0.96214  0.96214  0.96214       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96213  0.96220  0.96213  0.96219      19
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96219       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96222       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96223       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96220      29
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96220      29
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96222       0

Desired Output:

                     fi  ts     open     high      low    close  volume
datetime 
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96222       29

Additional Information:

There are two data sources which can be identified by their 'Volume' values:

a. Volume = 0 (more frequent, less reliable)
b. Volume > 0 (less frequent, more reliable)

As type 'b.' is more reliable, it is preferrable to use its open value to type 'a' open value.

As to whether last aggregation, it doesn't really matter to be honest, other aggregations (first,max,min) would work, as the open value is the first quoted value in a minute (on this example) and never changes.

The issues of incorrect values arise when there is an interruption on the connection to the server. Type 'a' data cannot deal with this and will give me possibly wrong values, type 'b' data can deal with this just fine and will give me correct values.

Upvotes: 1

Views: 83

Answers (2)

jezrael
jezrael

Reputation: 862641

You can first aggregate by last of opencolumn:

ohlc_dict = {
   'high': 'max',
   'low': 'min',
   'close': 'last',
   'open':'last',
   'volume':'sum'
}

g = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df2 = g.agg(ohlc_dict)
print (df2)
                         low    close     high     open  volume
datetime                                                       
2017-11-17 12:35:00  0.96213  0.96222  0.96223  0.96215      77

Then filter out all 0 volumes and aggregate only last value of open:

g1 = df[df['volume'] > 0].groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df1 = g1['open'].last().reindex(df2.index)
print (df1)
datetime
2017-11-17 12:35:00    0.96213
Freq: T, Name: open, dtype: float64

Last combine both DataFrames to one with to_frame and combine_first:

df3 = df1.to_frame().combine_first(df2)
print (df3)
                       close     high      low     open  volume
datetime                                                       
2017-11-17 12:35:00  0.96222  0.96223  0.96213  0.96213    77.0

Use custom function with condition (slowier):

def ohlc_func(x):
    a = x.loc[x['volume'] > 0, 'open'].tail(1)
    a = a.item() if len(a) == 1 else x['open'].tail(1)[0]
    b = x['high'].max()
    c = x['low'].min()
    d = x['close'].tail(1)[0]
    e = x['volume'].sum()        
    col = ['open','high','low','close','volume']
    return pd.Series([a,b,c,d,e], index=col)


df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).apply(ohlc_func)
print (df)
                        open     high      low    close  volume
datetime                                                       
2017-11-17 12:35:00  0.96213  0.96223  0.96213  0.96222    77.0

Upvotes: 2

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can use boolean indexing for max volume and tail(1) for last value of open inside apply since you have duplicate index i.e

ohlc_dict = {
   'high': 'max',
   'low': 'min',
   'close': 'last',
   'volume': 'max',
}
grp = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
ndf = grp.agg(ohlc_dict)

ndf['open'] = grp['open','volume'].apply(lambda x : x[x['volume'] == x['volume'].max()].tail(1)['open'])

Output :

                        low  volume    close     high     open
datetime                                                       
2017-11-17 12:35:00  0.96213      29  0.96222  0.96223  0.96213

Upvotes: 2

Related Questions