Reputation: 776
Let's assume we have a DataFrame
with some columns and I need to find the conditional probability of A
given B
and C
(which are columns of this DataFrame
) simultaneously. How do I calculate that?
For one variable, that is finding the conditional probability of A
given B
this would be pretty straightforward, I can make a groupby()
and then a value_counts()
like this:
df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
However, this won't work if I select 2 columns like this:
df.groupby('A')[['B', 'C']]
because this is then no longer a SeriesGroupBy
object but rather a DataFrameGroupBy
object and I can't apply the value_counts()
function.
Edit
Example:
This is part of the DataFrame
>
This is the output if I want to find the conditional probability that a person survives given his traveling class:
Now, I want to find the conditional probability that a person survives given two variables, say his traveling class and sex.
Upvotes: 2
Views: 2318
Reputation: 18647
IIUC, just reverse your groupby
pattern - groupby the conditions and apply value_counts
to "survived":
df.groupby(['pclass', 'sex'])['survived'].value_counts(normalize=True)
And if you need the output as a DataFrame
, use Series.reset_index
:
df.groupby(['pclass', 'sex'])['survived'].value_counts(normalize=True).reset_index(name='prob')
Upvotes: 3