Reputation: 13054
I am learning how to use the pandas
python library. I am attempting a problem that is probably not the easiest thing given no prior experience with pandas
nor any statistical language such as Stata
.
Suppose I have a data set from a query about people's feelings toward pies and cakes. Most of the questions I have asked are of the form:
"Do you prefer pies over cakes?" or "Would you vote for a cherry pie for President of the United states in 2020?"
These lead to "Yes" or "No" answers.
Suppose I have 1000
people that have responded but they have some distinctions that matter to my upcoming analysis such as Gender
, Eye-Color
, and primary-hand-use
(right/left/ambidextrous). Also suppose I have hundreds of these distinctions and that eventually I want to compare them all against the same question.
Now, from my cake-pie.DTA
file I can run:
frame = pandas.read_stata("cake-pie.DTA")
answers = ["Yes", "No"]
pandas.crosstab(frame["Question_1", frame["Eye_Color"]], answers, normalize="columns")
And this will give me the following:
col_0 yes
col_1 no
Question_1 eye_color
Yes Blue 0.1500
Hazel 0.050
Brown 0.2100
Green 0.050
No Blue 0.2850
Hazel 0.0000
Brown 0.2450
Green 0.010
However, my 1000 people that have responded are not made of equal proportions of each eye color. Perhaps my population looks like:
Blue 435 (43.5%)
Hazel 50 (5.0%)
Brown 455 (45.5%)
Green 60 (6.0%)
The information I'd like to have output is not an estimation of Probability of GREEN & YES but rather, probability of Yes | GREEN (probability of Yes given Green eyes.)
I realize I can manually divide by the subpopulation total to get that answer but I'm not sure how to divide by the pandas Series
data that is my eye-color table above to do that in a single cross-tab.
Upvotes: 1
Views: 966
Reputation: 61967
Assuming your DataFrame looks like the image below you pivot it by unstacking and then dividing each row by its row total.
df1 = df.unstack(0)
df1.div(df1.sum(1), axis=0)
eye_color
No Yes
Blue 0.655172 0.344828
Brown 0.538462 0.461538
Green 0.166667 0.833333
Hazel 0.000000 1.000000
More explanation. unstack(0)
pivots the outer most level (the levels are zero indexed starting from the left) of the index to a column so you get the following frame.
.sum(1)
sums each row. The default is so sum down the columns (axis=0). Then we have to be tricky and use .div
with axis=0 to to divide by aligning only the index values.
Upvotes: 3