Reputation: 569
I have share prices for companies in a portfolio and my goal is to create new column df['Final_weights']
yet keep the sum of weights for each date and category the same for df['weights']
and df['final_weights']
.
I want to give 0 to weights for companies which have a share price in the bottom 30 percentile for a particular day against companies in the same category, and I want to give higher weight to companies which have a share price over 70% percentile for a particular day against companies in the same category.
I have a dataframe with multiple dates and companies with respective category:
Eg subset of df:
Date Category Company Price weight
1/1/2007 Automative Audi 1000 0.146
1/1/2007 Automative Alfa Romeo 400 0.143
1/1/2007 Automative Aston Martin500 0.002
1/1/2007 Automative Bentley 2000 0.025
1/1/2007 Automative Mercedes 3000 0.063
1/1/2007 Automative BMW 40 0.154
1/1/2007 Automative Volvo 3000 0.163
1/1/2007 Automative VW 200 0.003
1/1/2007 Technology Apple 400 0.120
1/1/2007 Technology Microsoft 5500 0.048
1/1/2007 Technology Google 230 0.069
1/1/2007 Technology Lenova 36 0.036
1/1/2007 Technology IBM 250 0.016
1/1/2007 Technology Sprint 231 0.013
OK now I have written some code which creates a new column giving the percentile rank of each company per date and per category. The code looks like this:
df['Pctile'] = df.Price.groupby([df.index, df.Category]).rank(pct='True')
Output:
Category Company Price weight Pctile
Date
1/1/2007 Automative Audi 1000 0.146 0.625000
1/1/2007 Automative Alfa Romeo 400 0.143 0.375000
1/1/2007 Automative Aston Martin 500 0.002 0.500000
1/1/2007 Automative Bentley 2000 0.025 0.750000
1/1/2007 Automative Mercedes 3000 0.063 0.937500
1/1/2007 Automative BMW 40 0.154 0.125000
1/1/2007 Automative Volvo 3000 0.163 0.937500
1/1/2007 Automative VW 200 0.003 0.250000
1/1/2007 Technology Apple 400 0.120 0.833333
1/1/2007 Technology Microsoft 5500 0.048 1.000000
1/1/2007 Technology Google 230 0.069 0.333333
1/1/2007 Technology Lenova 36 0.036 0.166667
1/1/2007 Technology IBM 250 0.016 0.666667
1/1/2007 Technology Sprint 231 0.013 0.500000
Now I want a final column called df['Final_weight']
.
All I want to do is for each date and category are these 3 things,
df['Pctile']
is <0.3
I want df['Final_weight'] = 0.
df['Pctile']
is >= 0.3
and <= 0.7
then
df['Final_weight'] = df['weight']
.df['PCtile'] >0.7
= (weight / sum of weights above
70%pctile) *(sum of weights above 70%pctile + sum of weights below
30%pctile)
Here is some sample output and example calculations:
1/1/2007
:1) sum of weights above 70%pctile = 0.251
2) sum of weights below 30%pctile = 0.157
Calculation for Bentley = 0.025 / 0.251 * (0.251 + 0.157) = 0.041
Calculation for Mercedes = 0.063 / 0.251 * (0.251 + 0.157) = 0.102
Calculation for Volvo = 0.163 / 0.251 * (0.251 + 0.157) = 0.265
Now the sums of weight and final_weight for Automative for 1/1/2007
are the same. They both sum to 0.699.
1/1/2007
:1) sum of weights above 70%pctile = 0.168
2) sum of weights below 30%pctile = 0.036
Calculation for Apple = 0.120 / 0.168 * (0.168 + 0.036) = 0.146
Calculation for Microsoft = 0.048 / 0.168 * (0.168 + 0.036) = 0.058
Now the sums of weight and final_weight for Technology for 1/1/2007
are the same. They both sum to 0.302. Also the sum of that date also remains 1.
Eg output:
Category Company Price weight Pctile Final_weight
Date
1/1/2007 Automative Audi 1000 0.146 0.625000 0.146
1/1/2007 Automative Alfa Romeo 400 0.143 0.375000 0.143
1/1/2007 Automative Aston Martin 500 0.002 0.500000 0.002
1/1/2007 Automative Bentley 2000 0.025 0.750000 0.041
1/1/2007 Automative Mercedes 3000 0.063 0.937500 0.102
1/1/2007 Automative BMW 40 0.154 0.125000 0.000
1/1/2007 Automative Volvo 3000 0.163 0.937500 0.265
1/1/2007 Automative VW 200 0.003 0.250000 0
1/1/2007 Technology Apple 400 0.120 0.833333 0.146
1/1/2007 Technology Microsoft 5500 0.048 1.000000 0.058
1/1/2007 Technology Google 230 0.069 0.333333 0.069
1/1/2007 Technology Lenova 36 0.036 0.166667 0.000
1/1/2007 Technology IBM 250 0.016 0.666667 0.016
1/1/2007 Technology Sprint 231 0.013 0.500000 0.013
My data is large, with many categories, dates, companies. Would love to see an efficient way of programming this. Thanks for the help.
Upvotes: 1
Views: 599
Reputation: 4744
Whilst I wish this was a groupby-of-a-groupby solution it isn't. It is a bit of a dirty hack. The reason I could not do the groupby solution is because, as far as I am aware, there is no way of selecting and passing columns into multiple argument functions with a groupby. Enough of what cannot be done...
Now I said it was hacky so give it a go on your data set. I dont know how fast it is on a large data set but do let me know.
import pandas as pd
#make a lazy example
date = ['1/1/2017']*10
category = ['car']*5 + ['tech']*5
company = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
price = [10, 300, 100, 400, 500, 230, 324, 543, 234, 124]
weight = [0.2, 0.1, 0.3, 0.2, 0.2, 0.15, 0.15, 0.4, 0.1, 0.2]
data = {'date': date, 'category': category, 'company': company, 'price': price, 'weight': weight}
df = pd.DataFrame(data)
#do you percentile thing
df['pctile'] = df.price.groupby([df.date, df.category]).rank(pct='True')
# define a function?
def seventy_thirty(df):
s = df.ix[df.pctile > 0.7, 'pctile']
s.ix[:] = 'upper'
l = df.ix[df.pctile < 0.3, 'pctile']
l.ix[:] = 'lower'
s = s.append(l)
return s
df['pctile_summary'] = seventy_thirty(df)
# created a dataframe with weights the we can merge make into another column
weighted = df.groupby(['date', 'category', 'pctile_summary']).sum().ix[:, ['weight']]
# add lowers onto uppers as we'll need them in final_weights
add_lower = weighted.ix[weighted.index.get_level_values('pctile_summary')=='lower', ['weight']].reset_index(level=2)
add_lower.pctile_summary = 'upper'
add_lower = add_lower.set_index('pctile_summary', append=True)
weighted = pd.merge(weighted, add_lower, how='left', left_index=True, right_index=True, suffixes=['', '_lower'])
# Now add all new columns and calculate the final_weight
df1 = pd.merge(df, weighted.reset_index(), how='left', on=['date', 'category', 'pctile_summary'], suffixes=['', '_sum'])
df1.ix[df1.pctile_summary=='lower', 'final_weight'] = 0
df1.ix[df1.pctile_summary.isnull(), 'final_weight'] = df1.weight
df1.ix[df1.pctile_summary=='upper', 'final_weight'] = (df1.weight / df1.weight_sum) * (df1.weight_sum + df1.weight_lower)
#finally tidy up (delete all that hardwork!)
df1 = df1.drop(['pctile_summary', 'weight_sum', 'weight_lower'], axis=1)
df1
Upvotes: 1