Reputation: 1888
My code calls a function using dataframe.apply(). The function returns several values using a pandas.Series. However, the dataframe.apply() applies the values to the wrong columns.
The code below attempts to return dte, mark and iv. These values are printed out before the return statement to validate the values.
import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm
# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400 # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar') # Load US Federal holiday calendar
USFedCal.rules.pop(7) # Remove Veteran's Day
USFedCal.rules.pop(6) # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday) # Add Good Friday
cal = tradingCal()
def newtonRap(row):
# Initialize variables
dte, mark, iv = 0.0, 0.0, 0.0
if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
row['TimeStamp'] == row['Expiry']:
iv, vega = 0.0, 0.0 # Set iv and vega to zero if option contract is invalid or expired
else:
# dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
# minus USFederalHolidays minus constant of 1 for the TimeStamp date
dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
mark = (row['Bid'] + row['Ask']) / 2
cp = 1 if row['OptType'] == 'C' else -1
S = row['RootPrice']
K = row['Strike']
T = (dte * tradingMinutesDay) / tradingMinutesAnnum
iv = sqrt(2 * pi / T) * mark / S # Initialize IV (Brenner and Subrahmanyam 1988)
vega = 0.0 # Initialize vega
for i in range(1, 100):
d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
d2 = d1 - iv * sqrt(T)
vega = S * norm.pdf(d1) * sqrt(T)
model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
iv -= (model - mark) / vega
if abs(model - mark) < 1.0e-5:
break
if isnan(iv) or isnan(vega):
iv, vega = 0.0, 0.0
print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
if __name__ == "__main__":
# sample data
col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
df = df[col_order]
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
print df[['DTE', 'Mark', 'newtRaphIV']]
When I print the dataframe columns for dte, mark and iv, the value of iv is applied to the mark column and the value of the mark is applied to the iv column.
See output below:
DTE 12.0 Mark 3.65 newtRaphIV 0.330446529117
DTE 12.0 Mark 2.38 newtRaphIV 0.297287843836
DTE 12.0 Mark 2.93 newtRaphIV 0.308354580411
DTE 12.0 Mark 2.77 newtRaphIV 0.287119199001
DTE 12.0 Mark 2.445 newtRaphIV 0.305461340472
DTE 12.0 Mark 3.175 newtRaphIV 0.272517270403
DTE 37.0 Mark 5.825 newtRaphIV 0.347642501561
DTE 37.0 Mark 6.2 newtRaphIV 0.368273860485
DTE Mark newtRaphIV
0 12 0.330447 3.650
1 12 0.297288 2.380
2 12 0.308355 2.930
3 12 0.287119 2.770
4 12 0.305461 2.445
5 12 0.272517 3.175
6 37 0.347643 5.825
7 37 0.368274 6.200
This is not the behavior I expected. What's going on?
Upvotes: 3
Views: 279
Reputation: 879093
df.apply(newtonRap, axis=1)
is a DataFrame with columns ['DTE', 'Mark', 'IV']
, but the order of the columns is not guaranteed (see below for the reason why). So to fix the order of the DataFrame columns, you could
fix the order of the Series index returned by newtonRap
:
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
or fix the order of the columns after df.apply
returns:
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
The first option is better since
df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
creates two intermediate DataFrames -- df.apply(newtonRap, axis=1)
and
df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]
, whereas the first option creates the correct DataFrame from the get-go.
DataFrame assignment aligns on index but not on columns:
Note that assignment of the form
df[['C','E','D']] = other_df
aligns based on index but not on column names. So it does not matter what the column names of df.apply(newtonRap, axis=1)
are. For instance, it would not help to change
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
to
return pd.Series({'DTE': dte, 'Mark': mark, 'newtRaphIV': iv})
to make the column names of df.apply(newtonRap, axis=1)
match those of
df[['DTE', 'Mark', 'newtRaphIV']]
. If it did, it would just be dumb luck
that the order of the columns returned by df.apply(newtonRap, axis=1)
happened to match the desired order. To corroborate this claim, consider the example
df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
# C D E
# 2 0 1 2
# 1 3 4 5
# 0 6 7 8
df[['C','E','D']] = new
# A B C E D
# 0 7 9 6 7 8
# 1 4 9 3 4 5
# 2 8 2 0 1 2
Notice that the index of new
and df
get aligned, but there is no alignment based on column labels.
Fixing the order of the DataFrame columns returned by apply
:
Note that dict keys are unordered. In other words, when iterated over, dict keys may appear in any order. In fact, in Python3, dict.keys()
may return the same keys in a different order each time the same code is run.
Because dict keys have an indeterminate order,
pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})
is a Series whose index has an indeterminate order, and therefore df.apply(newtonRap, axis=1)
is a DataFrame whose columns appear in an indeterminate order.
However, if you use
return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])
then the order of the Series index is fixed. And thus df.apply(newtonRap, axis=1)
has a fixed column order, and then
df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
will work as desired.
Upvotes: 4