Reputation: 1165
I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?
df.groupby('a').groupby('b')
does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.
df.groupby('a').apply(lambda x: x[x['b']] == '...').count()
Upvotes: 10
Views: 26042
Reputation: 591
The question is a little odd, in that it suggests that column B has categorical values. Typically, we compute (conditional) expectations on real-valued variables. In this case, it's actually much simpler
df.groupby('A')['B'].mean()
For example, in the dataframe
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':[1, 1, 2, 3,2, 2, 1, 3], 'C':np.random.randn(8), 'D':np.random.randn(8)})
we get
A
bar 2.0
foo 1.8
Name: B, dtype: float64
Upvotes: 0
Reputation: 135
This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'
pd.crosstab(df.a, df.b, normalize='columns')
Will return a Dataframe representing P(a | b)
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
Consider the DataFrame:
df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
'b':['1', '2', '3', '4','5', '1', '2', '3']})
Looking at columns a and b
df[["a", "b"]]
We have
a b
0 x 1
1 x 2
2 x 3
3 y 4
4 y 5
5 y 1
6 y 2
7 z 3
Then
pd.crosstab(df.a, df.b)
returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b
b 1 2 3 4 5
a
x 1 1 1 0 0
y 1 1 0 1 1
z 0 0 1 0 0
We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)
pd.crosstab(df.a, df.b, normalize='columns')
Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B)
for specific values of B
b 1 2 3 4 5
a
x 0.5 0.5 0.5 0.0 0.0
y 0.5 0.5 0.0 1.0 1.0
z 0.0 0.0 0.5 0.0 0.0
Notice, the columns sum to 1.
If we would instead prefer to get P(b | a)
, we could normalize over the rows
pd.crosstab(df.a, df.b, normalize='rows')
To get
b 1 2 3 4 5
a
x 0.333333 0.333333 0.333333 0.00 0.00
y 0.250000 0.250000 0.000000 0.25 0.25
z 0.000000 0.000000 1.000000 0.00 0.00
Where the rows represent the conditional probabilities P(b | a=A)
for specific values of A. Notice, the rows sum to 1.
Upvotes: 11
Reputation: 36545
To find the total number of class b
for each instance of class a
you would do
df.groupby('a').b.value_counts()
For example, create a DataFrame as below:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
A B C D
0 foo one -1.565185 -0.465763
1 bar one 2.499516 -0.941229
2 foo two -0.091160 0.689009
3 bar three 1.358780 -0.062026
4 foo two -0.800881 -0.341930
5 bar two -0.236498 0.198686
6 foo one -0.590498 0.281307
7 foo three -1.423079 0.424715
Then:
df.groupby('A')['B'].value_counts()
A
bar one 1
two 1
three 1
foo one 2
two 2
three 1
To convert this to a conditional probability, you need to divide by the total size of each group.
You can either do it with another groupby:
df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Or you can apply a lambda
function onto the groups:
df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))
Upvotes: 22
Reputation: 1165
Consider the DataFrame that Maxymoo suggested:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
df
A B C D
0 foo one 0.229206 -1.899999
1 bar one 0.174972 0.328746
2 foo two -1.384699 -1.691151
3 bar three -1.008328 -0.915467
4 foo two -0.065298 -0.107240
5 bar two 1.871916 0.798135
6 foo one 1.589609 -1.682237
7 foo three 2.292783 0.639595
Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?
Approach 1:
df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B
one foo 0.666667
bar 0.333333
three foo 0.500000
bar 0.500000
two foo 0.666667
bar 0.333333
dtype: float64
So the answer is: 0.6667
Approach 2:
Probability of x = one: 0.375
df['B'].value_counts()/df['B'].count()
one 0.375
two 0.375
three 0.250
dtype: float64
Probability of y = foo: 0.625
df['A'].value_counts()/df['A'].count()
foo 0.625
bar 0.375
dtype: float64
Probability of (x=one|y=foo): 0.4
df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667
Upvotes: 1
Reputation: 11
You could try this function,
def conprob(pd1,pd2,transpose=1):
if transpose==0:
table=pd.crosstab(pd1,pd2)
else:
table=pd.crosstab(pd2,pd1)
cnames=table.columns.values
weights=1/table[cnames].sum()
out=table*weights
pc=table[cnames].sum()/table[cnames].sum().sum()
table=table.transpose()
cnames=table.columns.values
p=table[cnames].sum()/table[cnames].sum().sum()
out['p']=p
return out
This return de conditional probability P( row |column )
Upvotes: 1