user3761555
user3761555

Reputation:

Vectorized lookup another row + calculated field

I have this DataFrame "dfSummary" -

exchangeBalances = [['ETHBTC','binance',10], ['LTCBTC','binance',10], ['XRPBTC','binance',10], ['ETHBTC','bitfinex',10], ['LTCBTC','bitfinex',10], ['XRPBTC','bitfinex',10]]
bidOffers = [
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 15)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 15)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 15)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 15)],
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 30)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 30)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 30)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 30)], 
             ['ETHBTC','binance', 0.0035, 0.0351, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','binance',0.009,0.092, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','binance',0.000077, 0.000078, datetime(2018, 9, 1, 8, 45)], ['ETHBTC','bitfinex', 0.003522, 0.0353, datetime(2018, 9, 1, 8, 45)], ['LTCBTC','bitfinex',0.0093,0.095, datetime(2018, 9, 1, 8, 45)], ['XRPBTC','bitfinex',0.000083, 0.000085, datetime(2018, 9, 1, 8, 45)]
             ]
dfExchangeBalances = pd.DataFrame(exchangeBalances, columns=['symbol','exchange','balance'])
dfBidOffers = pd.DataFrame(bidOffers, columns=['symbol','exchange','bid', 'offer', 'created'])
dfBidOffers["spread"] = dfBidOffers["bid"] - dfBidOffers["offer"]
dfSummary = dfExchangeBalances.merge(dfBidOffers, how='left', on=['symbol','exchange'])

What I need to accomplish is, add a computed field to "dfSummary":

currentRow["Spread"] - someOtherRow["Spread"]

"someOtherRow" is a look up based on "created" (For example last row with same {symbol, exchange} but "created" 30min ago (compared to "currentRow")

Clarification: Above example is simplification of real problem at hand. Intervals are not exactly 15 minutes. In fact I need to look up corresponding record (same key = {symbol, exchange}) in DataFrame, but first such record created on 1st month, quarter, and year.

I'm trying to avoid manually looping over DataFrame.iter and use Pandas built in lookup instead (Vectorization)

I'm thinking DataFrame.Lookup Vectorized look-up of values in Pandas dataframe But not sure how to use this from context of a calculated field ...? Also instead of lookup against a different DataFrame, I want a lookup against same DataFrame.

Vectorization (Pandas and Numpy - vs looping):

Upvotes: 1

Views: 292

Answers (3)

user3761555
user3761555

Reputation:

Vectorized!!!!!!!! (well... for the most part)

Idea is, use "merge" (self join) as supposed to "DataFrame.lookup" which is for entirely different application such as: Pandas DataFrame.lookup

Extending from original fix ...

STEP 1) ProfitLoss.py \ to_dict to pre-calculate TM1, MonthStart, QuarterStart, YearStart - since this is going to be called anyway.

import datetime
import time
import math

from Util import ObjectUtil
from Util import DateTimeUtil

import pandas as pd
import numpy as np

from Util import ObjectUtil

class ProfitLoss(object):
    def set(self, field, val):
        setattr(self, field, val)

    def to_dict(self):
        result = ObjectUtil.objectPropertiesToDictionary(self)
        result["TM1"] = self.COB - datetime.timedelta(days=1)
        result["MonthStart"] = self.COB.replace(day=1)
        result["QuarterStart"] = DateTimeUtil.getQuarterStart(self.COB, self.COB.year)
        result["YearStart"] = datetime.datetime(self.COB.year, 1, 1)

        return result

    @staticmethod
    def from_dict(dict):
        if dict is None:
            return None

        profitLosses = []
        for k, v in dict.items():
            numPnl = len(v)
            for i in range(0, numPnl):
                pnl = ProfitLoss()
                profitLosses.append(pnl)
            break

        for k, v in dict.items():
            if k == "from_dict":
                break

            i = 0
            for val in v.values():
                if isinstance(val, pd.Timestamp):
                    val = datetime.datetime(val.year, val.month, val.day)

                val = None if val == np.nan else val

                if isinstance(val, float) and math.isnan(val):
                    val = None

                profitLosses[i].set(k, val)
                i+=1

        return profitLosses

STEP 2) merge (i.e. Self-join), instead of DataFrame.apply or DataFrame.lookup:

        pdPnl = pd.DataFrame.from_records([pnl.to_dict() for pnl in profitLosses])
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["TM1"], right_on=["COB"], suffixes = ('','_tm1'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["MonthStart"], right_on=["COB"], suffixes = ('','_MonthStart'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["QuarterStart"], right_on=["COB"], suffixes = ('','_QuaterStart'))
        pdPnl = pdPnl.merge(pdPnl, how='inner', left_on=["YearStart"], right_on=["COB"], suffixes = ('','_YearStart'))

        # Vectorized
        pdPnl["DTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_tm1"]
        pdPnl["DTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_tm1"]
        pdPnl["MTDRealizedPnl"] =  pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_MonthStart"]
        pdPnl["MTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_MonthStart"]
        pdPnl["YTDRealizedPnl"] = pdPnl["InceptionRealizedPnl"] - pdPnl["InceptionRealizedPnl_YearStart"]
        pdPnl["YTDUnrealizedPnl"] = pdPnl["InceptionUnrealizedPnl"] - pdPnl["InceptionUnrealizedPnl_YearStart"]

        # Not yet vectorized
        pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
        pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)

        pnlDict = pdPnl.to_dict()
        updatedProfitLosses = ProfitLoss.ProfitLoss.from_dict(pnlDict)

Actually i am not sure if merge/self join's any more performant than explicit loops. Also, I still haven't figured what to do what Sharpe Ratio and MaxDrawdown though!! Panda's Windowing function don't seem to help...

People?! Thanks!!

Upvotes: 0

user3761555
user3761555

Reputation:

I got it, here's my real code (So I am not posting everything). This will work (but not sure if this is implemented the fastest way).

I am using DataFrame.apply. This is NOT Vectorized way, but should be a lot faster than looping in python. Can someone please kindly shed some light how to rewrite below in completely vectorized manner?

Referencing this article - https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

... I can't wrap my mind around to rewrite in vectorized way, and given nature of lookup, I am beginning to feel that below cannot be vectorized (Glad if one of you can prove me wrong):

pdPnl = pd.DataFrame.from_records([ObjectUtil.objectPropertiesToDictionary(pnl) for pnl in profitLosses], columns=ObjectUtil.objectPropertiesToDictionary(profitLosses[0]).keys())
pdPnl["TM1"] = pdPnl.apply(lambda rw : rw["COB"] - timedelta(days=1) , axis=1)
pdPnl["MonthStart"] = pdPnl.apply(lambda rw : rw["COB"].replace(day=1), axis=1)
pdPnl["QuarterStart"] = pdPnl.apply(lambda rw : DateTimeUtil.getQuarterStart(rw["COB"], rw["COB"].year), axis=1)
pdPnl["YearStart"] = pdPnl.apply(lambda rw : datetime(rw["COB"].year, 1, 1), axis=1)
pdPnl["DTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["DTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["TM1"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["MTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["MTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["MonthStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)
pdPnl["YTDRealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionRealizedPnl"], "InceptionRealizedPnl"), axis=1)
pdPnl["YTDUnrealizedPnl"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeField(pdPnl, rw["YearStart"], rw["InceptionUnrealizedPnl"], "InceptionUnrealizedPnl"), axis=1)

pdPnl["SharpeRatio"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeSharpeRatio(pdPnl, rw["COB"]), axis=1)
pdPnl["MaxDrawDown"] = pdPnl.apply(lambda rw : PnlCalculatorBase.computeMaxDrawDown(pdPnl, rw["COB"]), axis=1)

pnlDict = pdPnl.to_dict()  # Then convert back to List of ProfitLoss (Slow...)

The lookup functions are:

@staticmethod
def lookUpRow(pdPnl, cob):
    return pdPnl[pdPnl["COB"]==cob]

@staticmethod
def computeField(pdPnl, cob, todaysPnl, targetField):
    val = np.nan
    otherRow = PnlCalculatorBase.lookUpRow(pdPnl, cob)
    if otherRow is not None and otherRow[targetField].shape[0]>0:
        try:
            tm1InceptionRealizedPnl = otherRow[targetField].iloc[0]
            val = todaysPnl - tm1InceptionRealizedPnl
        except:
            # slow...
            errMsg = "Failed lookup for " + str(cob) + " " + targetField
            logging.error(errMsg)
            val = np.nan
    return val


@staticmethod
def computeSharpeRatio(pdPnl, cob):
    val = None
    pdPnl = pdPnl[(pdPnl['COB']<=cob)]
    pdPnl = pdPnl.loc[:,["COB", "DTDRealizedPnl","DTDUnrealizedPnl"]]
    pdPnl["TotalDTD"] = pdPnl.apply(lambda rw : rw["DTDRealizedPnl"] + rw["DTDUnrealizedPnl"], axis=1)

    # @todo, We don't have risk free rate for Sharpe Ration calc. Here's just total DTD avg return over standard deviation
    # https://en.wikipedia.org/wiki/Sharpe_ratio
    mean = pdPnl["TotalDTD"].mean()
    std = pdPnl["TotalDTD"].std()
    val = mean / std

    return val

@staticmethod
def computeMaxDrawDown(pdPnl, cob):
    val = None
    pdPnl = pdPnl[(pdPnl['COB']<=cob) & (pdPnl["DTDRealizedPnl"]<0)]
    val = pdPnl["DTDRealizedPnl"].min()
    return val

Upvotes: 1

Vishnu Kunchur
Vishnu Kunchur

Reputation: 1726

This is assuming created has constant 15-minute intervals. You can groupby symbol and exchange, and shift down by 2 (for 2 periods, since each period is 15 minutes):

dfSummary['30min_ago_spread'] = dfSummary.groupby(['symbol', 'exchange'])['spread'].shift(2)

Output:

   symbol exchange  balance   offer  spread  created               30min_ago_spread
0  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:15:00       NaN
1  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:30:00       NaN
2  ETHBTC  binance       10  0.0351 -0.0316  2018-09-01 08:45:00   -0.0316
3  LTCBTC  binance       10  0.0920 -0.0830  2018-09-01 08:15:00       NaN
4  LTCBTC  binance       10  0.0920 -0.0830  2018-09-01 08:30:00       NaN

Upvotes: 0

Related Questions