Bendemann
Bendemann

Reputation: 776

Calculating conditional probability given two other variables

Let's assume we have a DataFrame with some columns and I need to find the conditional probability of A given B and C (which are columns of this DataFrame) simultaneously. How do I calculate that?

For one variable, that is finding the conditional probability of A given B this would be pretty straightforward, I can make a groupby() and then a value_counts() like this:

df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()

However, this won't work if I select 2 columns like this:

df.groupby('A')[['B', 'C']]

because this is then no longer a SeriesGroupBy object but rather a DataFrameGroupBy object and I can't apply the value_counts() function.

Edit

Example:

This is part of the DataFrame

enter image description here>


This is the output if I want to find the conditional probability that a person survives given his traveling class:

enter image description here



Now, I want to find the conditional probability that a person survives given two variables, say his traveling class and sex.

Upvotes: 2

Views: 2318

Answers (1)

Chris Adams
Chris Adams

Reputation: 18647

IIUC, just reverse your groupby pattern - groupby the conditions and apply value_counts to "survived":

df.groupby(['pclass', 'sex'])['survived'].value_counts(normalize=True)

And if you need the output as a DataFrame, use Series.reset_index:

df.groupby(['pclass', 'sex'])['survived'].value_counts(normalize=True).reset_index(name='prob')

Upvotes: 3

Related Questions