Reputation: 8594
I am trying to calculate the distribution of values in a pandas
pivot table. I am having trouble calculating the values and understanding the proper way to access multi-index series/columns.
Here is a sample dataframe:
person = ['Jack', 'John', 'Mark']
randoms = ['A', 'B', 'C']
df = pd.DataFrame([], columns=['id', 'person','state', 'choice', 'num'])
for i in range(0,500):
row = [i, random.choice(person), random.choice(randoms), random.choice(randoms), random.randrange(1, 250)]
append_df = pd.DataFrame([row], columns=['id', 'person','state', 'choice', 'num'])
df = df.append(append_df)
df.reset_index(drop=True, inplace=True)
pd.pivot_table(
data = df,
index = ['person','choice'],
columns = ['state'],
values='num',
aggfunc = ['sum', 'count'],
margins=True,
margins_name='total',
fill_value=0
)
The output of the pivot table looks something like this:
sum sum sum sum count count count count
state A B C total A B C total
person choice
Jack A 1519 1667 1460 4646 15 13 11 39
Jack B 2078 1641 3200 6919 17 12 28 57
Jack C 2166 1845 3575 7586 13 17 28 58
John A 3241 2028 1880 7149 26 20 18 64
John B 2467 2517 1200 6184 21 23 12 56
John C 1585 2481 2791 6857 16 19 23 58
Mark A 2320 2647 1858 6825 20 19 18 57
Mark B 2807 2809 3116 8732 21 24 23 68
Mark C 1953 1503 2558 6014 11 13 19 43
total 20136 19138 21638 60912 160 160 180 500
What I am trying to calculate is the % of the num value by column/index to get the distribution, like so [The calc would be sum A / sum total, sum B / sum total, etc]:
sum sum sum sum count count count count
state A B C total A B C total A_Pct B_Pct C_Pct total_pct
person choice
Jack A 1519 1667 1460 4646 15 13 11 39 0.33 0.36 0.31 1.00
Jack B 2078 1641 3200 6919 17 12 28 57 0.30 0.24 0.46 1.00
Jack C 2166 1845 3575 7586 13 17 28 58 0.29 0.24 0.47 1.00
John A 3241 2028 1880 7149 26 20 18 64 0.45 0.28 0.26 1.00
John B 2467 2517 1200 6184 21 23 12 56 0.40 0.41 0.19 1.00
John C 1585 2481 2791 6857 16 19 23 58 0.23 0.36 0.41 1.00
Mark A 2320 2647 1858 6825 20 19 18 57 0.34 0.39 0.27 1.00
Mark B 2807 2809 3116 8732 21 24 23 68 0.32 0.32 0.36 1.00
Mark C 1953 1503 2558 6014 11 13 19 43 0.32 0.25 0.43 1.00
total 20136 19138 21638 60912 160 160 180 500 0.33 0.31 0.36 1.00
What's the best way to access the multi-index to calculate this? Is there a standard way to call indexes by name rather than levels?
Upvotes: 0
Views: 83
Reputation: 120459
You can use:
# piv = pd.pivot_table(...
pct = piv['sum'].iloc[:, :-1].div(piv['sum'].iloc[:, -1], axis=0)
piv = piv.join(pd.concat([pct, pct.sum(1).rename('total')],
keys=['pct', 'pct'], axis=1))
Output:
>>> piv
sum count pct
state A B C total A B C total A B C total
person choice
Jack A 2084 1836 2453 6373 19 16 20 55 0.327005 0.288090 0.384905 1.0
B 2875 1491 1925 6291 24 12 18 54 0.457002 0.237005 0.305993 1.0
C 2208 3513 2084 7805 19 28 18 65 0.282896 0.450096 0.267008 1.0
John A 2383 2755 1554 6692 19 21 11 51 0.356097 0.411686 0.232218 1.0
B 1779 1828 2790 6397 18 16 19 53 0.278099 0.285759 0.436142 1.0
C 1706 1841 3311 6858 12 16 24 52 0.248761 0.268446 0.482794 1.0
Mark A 1633 2173 1117 4923 20 18 14 52 0.331708 0.441398 0.226894 1.0
B 2099 2533 2253 6885 18 23 21 62 0.304866 0.367901 0.327233 1.0
C 2445 1925 2652 7022 20 14 22 56 0.348191 0.274138 0.377670 1.0
total 19212 19895 20139 59246 169 164 167 500 0.324275 0.335803 0.339922 1.0
Upvotes: 1