vlmercado
vlmercado

Reputation: 1888

Pandas dataframe.apply() misapplying values to dataframe columns

My code calls a function using dataframe.apply(). The function returns several values using a pandas.Series. However, the dataframe.apply() applies the values to the wrong columns.

The code below attempts to return dte, mark and iv. These values are printed out before the return statement to validate the values.

import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm


# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450                                 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400                            # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar')     # Load US Federal holiday calendar
USFedCal.rules.pop(7)                                   # Remove Veteran's Day
USFedCal.rules.pop(6)                                   # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday)    # Add Good Friday
cal = tradingCal()


def newtonRap(row):
    # Initialize variables
    dte, mark, iv = 0.0, 0.0, 0.0
    if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
       row['TimeStamp'] == row['Expiry']:
        iv, vega = 0.0, 0.0         # Set iv and vega to zero if option contract is invalid or expired
    else:
        # dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
        #   minus USFederalHolidays minus constant of 1 for the TimeStamp date
        dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
                    len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
        mark = (row['Bid'] + row['Ask']) / 2
        cp = 1 if row['OptType'] == 'C' else -1
        S = row['RootPrice']
        K = row['Strike']
        T = (dte * tradingMinutesDay) / tradingMinutesAnnum
        iv = sqrt(2 * pi / T) * mark / S        # Initialize IV (Brenner and Subrahmanyam 1988)
        vega = 0.0                              # Initialize vega
        for i in range(1, 100):
            d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
            d2 = d1 - iv * sqrt(T)
            vega = S * norm.pdf(d1) * sqrt(T)
            model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
            iv -= (model - mark) / vega
            if abs(model - mark) < 1.0e-5:
                break
        if isnan(iv) or isnan(vega):
            iv, vega = 0.0, 0.0
    print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
    return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})


if __name__ == "__main__":
    # sample  data
    col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
    df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
                       'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
                       'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
                       'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
                       'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
                       'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
                       'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
                       'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
                       'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
                       'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
                       'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
                       'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
                       'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
    df = df[col_order]


    df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
    print df[['DTE', 'Mark', 'newtRaphIV']]

When I print the dataframe columns for dte, mark and iv, the value of iv is applied to the mark column and the value of the mark is applied to the iv column.

See output below:

DTE 12.0 Mark 3.65 newtRaphIV 0.330446529117
DTE 12.0 Mark 2.38 newtRaphIV 0.297287843836
DTE 12.0 Mark 2.93 newtRaphIV 0.308354580411
DTE 12.0 Mark 2.77 newtRaphIV 0.287119199001
DTE 12.0 Mark 2.445 newtRaphIV 0.305461340472
DTE 12.0 Mark 3.175 newtRaphIV 0.272517270403
DTE 37.0 Mark 5.825 newtRaphIV 0.347642501561
DTE 37.0 Mark 6.2 newtRaphIV 0.368273860485
   DTE      Mark  newtRaphIV
0   12  0.330447       3.650
1   12  0.297288       2.380
2   12  0.308355       2.930
3   12  0.287119       2.770
4   12  0.305461       2.445
5   12  0.272517       3.175
6   37  0.347643       5.825
7   37  0.368274       6.200

This is not the behavior I expected. What's going on?

Upvotes: 3

Views: 279

Answers (1)

unutbu
unutbu

Reputation: 879093

df.apply(newtonRap, axis=1)

is a DataFrame with columns ['DTE', 'Mark', 'IV'], but the order of the columns is not guaranteed (see below for the reason why). So to fix the order of the DataFrame columns, you could fix the order of the Series index returned by newtonRap:

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

or fix the order of the columns after df.apply returns:

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

The first option is better since

df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

creates two intermediate DataFrames -- df.apply(newtonRap, axis=1) and df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']], whereas the first option creates the correct DataFrame from the get-go.


DataFrame assignment aligns on index but not on columns:

Note that assignment of the form

df[['C','E','D']] = other_df

aligns based on index but not on column names. So it does not matter what the column names of df.apply(newtonRap, axis=1) are. For instance, it would not help to change

return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

to

return pd.Series({'DTE': dte, 'Mark': mark, 'newtRaphIV': iv})

to make the column names of df.apply(newtonRap, axis=1) match those of df[['DTE', 'Mark', 'newtRaphIV']]. If it did, it would just be dumb luck that the order of the columns returned by df.apply(newtonRap, axis=1) happened to match the desired order. To corroborate this claim, consider the example

df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
#    C  D  E
# 2  0  1  2
# 1  3  4  5
# 0  6  7  8

df[['C','E','D']] = new
#    A  B  C  E  D
# 0  7  9  6  7  8
# 1  4  9  3  4  5
# 2  8  2  0  1  2

Notice that the index of new and df get aligned, but there is no alignment based on column labels.


Fixing the order of the DataFrame columns returned by apply:

Note that dict keys are unordered. In other words, when iterated over, dict keys may appear in any order. In fact, in Python3, dict.keys() may return the same keys in a different order each time the same code is run.

Because dict keys have an indeterminate order,

pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

is a Series whose index has an indeterminate order, and therefore df.apply(newtonRap, axis=1) is a DataFrame whose columns appear in an indeterminate order.

However, if you use

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

then the order of the Series index is fixed. And thus df.apply(newtonRap, axis=1) has a fixed column order, and then

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)

will work as desired.

Upvotes: 4

Related Questions