BlackVegetable
BlackVegetable

Reputation: 13054

Cross Tab using conditional sub-populations

I am learning how to use the pandas python library. I am attempting a problem that is probably not the easiest thing given no prior experience with pandas nor any statistical language such as Stata.

Suppose I have a data set from a query about people's feelings toward pies and cakes. Most of the questions I have asked are of the form:

"Do you prefer pies over cakes?" or "Would you vote for a cherry pie for President of the United states in 2020?"

These lead to "Yes" or "No" answers.

Suppose I have 1000 people that have responded but they have some distinctions that matter to my upcoming analysis such as Gender, Eye-Color, and primary-hand-use (right/left/ambidextrous). Also suppose I have hundreds of these distinctions and that eventually I want to compare them all against the same question.

Now, from my cake-pie.DTA file I can run:

frame = pandas.read_stata("cake-pie.DTA")
answers = ["Yes", "No"]
pandas.crosstab(frame["Question_1", frame["Eye_Color"]], answers, normalize="columns")

And this will give me the following:

col_0          yes
col_1           no
Question_1 eye_color
Yes  Blue  0.1500
     Hazel 0.050
     Brown 0.2100
     Green 0.050
No   Blue  0.2850
     Hazel 0.0000
     Brown 0.2450
     Green 0.010

However, my 1000 people that have responded are not made of equal proportions of each eye color. Perhaps my population looks like:

Blue  435 (43.5%)
Hazel  50 (5.0%)
Brown 455 (45.5%)
Green  60 (6.0%)

The information I'd like to have output is not an estimation of Probability of GREEN & YES but rather, probability of Yes | GREEN (probability of Yes given Green eyes.)

I realize I can manually divide by the subpopulation total to get that answer but I'm not sure how to divide by the pandas Series data that is my eye-color table above to do that in a single cross-tab.

Upvotes: 1

Views: 966

Answers (1)

Ted Petrou
Ted Petrou

Reputation: 61967

Assuming your DataFrame looks like the image below you pivot it by unstacking and then dividing each row by its row total.

enter image description here

df1 = df.unstack(0)
df1.div(df1.sum(1), axis=0)

      eye_color          
             No       Yes
Blue   0.655172  0.344828
Brown  0.538462  0.461538
Green  0.166667  0.833333
Hazel  0.000000  1.000000

More explanation. unstack(0) pivots the outer most level (the levels are zero indexed starting from the left) of the index to a column so you get the following frame.

enter image description here

.sum(1) sums each row. The default is so sum down the columns (axis=0). Then we have to be tricky and use .div with axis=0 to to divide by aligning only the index values.

Upvotes: 3

Related Questions