Reputation:
I have this DataFrame "dfSummary" -
exchangeBalances = [['ETHBTC','binance',10], ['LTCBTC','binance',10], ['XRPBTC','binance',10], ['ETHBTC','bitfinex',10], ['LTCBTC','bitfinex',10], ['XRPBTC','bitfinex',10]]
bidOffers = [
['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 15)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 15)],
['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 30)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 30)],
['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 45)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 45)]
]
dfExchangeBalances = pd.DataFrame(exchangeBalances, columns=['symbol','exchange','balance'])
dfBidOffers = pd.DataFrame(bidOffers, columns=['symbol','exchange','bid', 'offer', 'created'])
dfBidOffers["spread"] = dfBidOffers["bid"] - dfBidOffers["offer"]
dfSummary = dfExchangeBalances.merge(dfBidOffers, how='left', on=['symbol','exchange'])
What I need to accomplish is, add a computed field to "dfSummary":
currentRow["Spread"] - someOtherRow["Spread"]
"someOtherRow" is a look up based on "created" (For example last row with same {symbol, exchange} but "created" 30min ago (compared to "currentRow")
Clarification: Above example is simplification of real problem at hand. Intervals are not exactly 15 minutes. In fact I need to look up corresponding record (same key = {symbol, exchange}) in DataFrame, but first such record created on 1st month, quarter, and year.
I'm trying to avoid manually looping over DataFrame.iter and use Pandas built in lookup instead (Vectorization)
I'm thinking DataFrame.Lookup Vectorized look-up of values in Pandas dataframe But not sure how to use this from context of a calculated field ...? Also instead of lookup against a different DataFrame, I want a lookup against same DataFrame.
Vectorization (Pandas and Numpy - vs looping):
Upvotes: 1
Views: 292
Reputation:
Vectorized!!!!!!!! (well... for the most part)
Idea is, use "merge" (self join) as supposed to "DataFrame.lookup" which is for entirely different application such as: Pandas DataFrame.lookup
Extending from original fix ...
STEP 1) ProfitLoss.py \ to_dict to pre-calculate TM1, MonthStart, QuarterStart, YearStart - since this is going to be called anyway.
import datetime
import time
import math
from Util import ObjectUtil
from Util import DateTimeUtil
import pandas as pd
import numpy as np
from Util import ObjectUtil
class ProfitLoss(object):
def set(self, field, val):
setattr(self, field, val)
def to_dict(self):
result = ObjectUtil.objectPropertiesToDictionary(self)
result["TM1"] = self.COB - datetime.timedelta(days=1)
result["MonthStart"] = self.COB.replace(day=1)
result["QuarterStart"] = DateTimeUtil.getQuarterStart(self.COB, self.COB.year)
result["YearStart"] = datetime.datetime(self.COB.year, 1, 1)
return result
@staticmethod
def from_dict(dict):
if dict is None:
return None
profitLosses = []
for k, v in dict.items():
numPnl = len(v)
for i in range(0, numPnl):
pnl = ProfitLoss()
profitLosses.append(pnl)
break
for k, v in dict.items():
if k == "from_dict":
break
i = 0
for val in v.values():
if isinstance(val, pd.Timestamp):
val = datetime.datetime(val.year, val.month, val.day)
val = None if val == np.nan else val
if isinstance(val, float) and math.isnan(val):
val = None
profitLosses[i].set(k, val)
i+=1
return profitLosses
STEP 2) merge (i.e. Self-join), instead of DataFrame.apply or DataFrame.lookup:
pdPnl = pd.DataFrame.from_records([pnl.to_dict() for pnl in profitLosses])
pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["TM1"], right_on=["COB"], suffixes = ('','_tm1'))
pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["MonthStart"], right_on=["COB"], suffixes = ('','_MonthStart'))
pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["QuarterStart"], right_on=["COB"], suffixes = ('','_QuaterStart'))
pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["YearStart"], right_on=["COB"], suffixes = ('','_YearStart'))
# Vectorized
pdPnl["DTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_tm1"]
pdPnl["DTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_tm1"]
pdPnl["MTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_MonthStart"]
pdPnl["MTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_MonthStart"]
pdPnl["YTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_YearStart"]
pdPnl["YTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_YearStart"]
# Not yet vectorized
pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)
pnlDict = pdPnl.to_dict()
updatedProfitLosses = ProfitLoss.ProfitLoss.from_dict(pnlDict)
Actually i am not sure if merge/self join's any more performant than explicit loops. Also, I still haven't figured what to do what Sharpe Ratio and MaxDrawdown though!! Panda's Windowing function don't seem to help...
People?! Thanks!!
Upvotes: 0
Reputation:
I got it, here's my real code (So I am not posting everything). This will work (but not sure if this is implemented the fastest way).
I am using DataFrame.apply. This is NOT Vectorized way, but should be a lot faster than looping in python. Can someone please kindly shed some light how to rewrite below in completely vectorized manner?
Referencing this article - https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
... I can't wrap my mind around to rewrite in vectorized way, and given nature of lookup, I am beginning to feel that below cannot be vectorized (Glad if one of you can prove me wrong):
pdPnl = pd.DataFrame.from_records([ObjectUtil.objectPropertiesToDictionary(pnl) for pnl in profitLosses], columns=ObjectUtil.objectPropertiesToDictionary(profitLosses[0]).keys())
pdPnl["TM1"] = pdPnl.apply(lambda rw : rw["COB"] - timedelta(days=1) , axis=1)
pdPnl["MonthStart"] = pdPnl.apply(lambda rw : rw["COB"].replace(day=1), axis=1)
pdPnl["QuarterStart"] = pdPnl.apply(lambda rw : DateTimeUtil.getQuarterStart(rw["COB"], rw["COB"].year), axis=1)
pdPnl["YearStart"] = pdPnl.apply(lambda rw : datetime(rw["COB"].year, 1, 1), axis=1)
pdPnl["DTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["DTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["MTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["MTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["YTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["YTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)
pnlDict = pdPnl.to_dict() # Then convert back to List of ProfitLoss (Slow...)
The lookup functions are:
@staticmethod
def lookUpRow(pdPnl, cob):
return pdPnl[pdPnl["COB"]==cob]
@staticmethod
def computeField(pdPnl, cob, todaysPnl, targetField):
val = np.nan
otherRow = PnlCalculatorBase.lookUpRow(pdPnl, cob)
if otherRow is not None and otherRow[targetField].shape[0]>0:
try:
tm1InceptionRealizedPnl = otherRow[targetField].iloc[0]
val = todaysPnl - tm1InceptionRealizedPnl
except:
# slow...
errMsg = "Failed lookup for " + str(cob) + " " + targetField
logging.error(errMsg)
val = np.nan
return val
@staticmethod
def computeSharpeRatio(pdPnl, cob):
val = None
pdPnl = pdPnl[(pdPnl['COB']<=cob)]
pdPnl = pdPnl.loc[:,["COB", "DTDRealizedPnl","DTDUnrealizedPnl"]]
pdPnl["TotalDTD"] = pdPnl.apply(lambda rw : rw["DTDRealizedPnl"] + rw["DTDUnrealizedPnl"], axis=1)
# @todo, We don't have risk free rate for Sharpe Ration calc. Here's just total DTD avg return over standard deviation
# https://en.wikipedia.org/wiki/Sharpe_ratio
mean = pdPnl["TotalDTD"].mean()
std = pdPnl["TotalDTD"].std()
val = mean / std
return val
@staticmethod
def computeMaxDrawDown(pdPnl, cob):
val = None
pdPnl = pdPnl[(pdPnl['COB']<=cob) & (pdPnl["DTDRealizedPnl"]<0)]
val = pdPnl["DTDRealizedPnl"].min()
return val
Upvotes: 1
Reputation: 1726
This is assuming created
has constant 15-minute intervals. You can groupby
symbol and exchange, and shift down by 2 (for 2 periods, since each period is 15 minutes):
dfSummary['30min_ago_spread'] = dfSummary.groupby(['symbol', 'exchange'])['spread'].shift(2)
Output:
symbol exchange balance offer spread created 30min_ago_spread
0 ETHBTC binance 10 0.0351 -0.0316 2018-09-01 08:15:00 NaN
1 ETHBTC binance 10 0.0351 -0.0316 2018-09-01 08:30:00 NaN
2 ETHBTC binance 10 0.0351 -0.0316 2018-09-01 08:45:00 -0.0316
3 LTCBTC binance 10 0.0920 -0.0830 2018-09-01 08:15:00 NaN
4 LTCBTC binance 10 0.0920 -0.0830 2018-09-01 08:30:00 NaN
Upvotes: 0