Hamid K
Hamid K

Reputation: 1165

Pandas - Conditional Probability of a given specific b

I have DataFrame with two columns of "a" and "b". How can I find the conditional probability of "a" given specific "b"?

df.groupby('a').groupby('b')

does not work. Lets assume I have 3 categories in column a, for each specific on I have 5 categories of b. What I need to do is to find total number of on class of b for each class of a. I tried apply command, but I think I do not know how to use it properly.

df.groupby('a').apply(lambda x: x[x['b']] == '...').count()

Upvotes: 10

Views: 26042

Answers (6)

tom4everitt
tom4everitt

Reputation: 591

The question is a little odd, in that it suggests that column B has categorical values. Typically, we compute (conditional) expectations on real-valued variables. In this case, it's actually much simpler

df.groupby('A')['B'].mean()

For example, in the dataframe

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':[1, 1, 2, 3,2, 2, 1, 3], 'C':np.random.randn(8), 'D':np.random.randn(8)})

we get

A
bar 2.0
foo 1.8
Name: B, dtype: float64

Upvotes: 0

Okry Dokry
Okry Dokry

Reputation: 135

Answer:

This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'

pd.crosstab(df.a, df.b, normalize='columns')

Will return a Dataframe representing P(a | b)

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

Explanation:

Consider the DataFrame:

df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
                   'b':['1', '2', '3', '4','5', '1', '2', '3']})

Looking at columns a and b

df[["a", "b"]]

We have

    a   b
0   x   1
1   x   2
2   x   3
3   y   4
4   y   5
5   y   1
6   y   2
7   z   3

Then

pd.crosstab(df.a, df.b)

returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b

b   1   2   3   4   5
a                   
x   1   1   1   0   0
y   1   1   0   1   1
z   0   0   1   0   0

We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)

pd.crosstab(df.a, df.b, normalize='columns')

Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B) for specific values of B

b    1   2   3   4   5
a
x   0.5 0.5 0.5 0.0 0.0
y   0.5 0.5 0.0 1.0 1.0
z   0.0 0.0 0.5 0.0 0.0

Notice, the columns sum to 1.

If we would instead prefer to get P(b | a), we could normalize over the rows

pd.crosstab(df.a, df.b, normalize='rows')

To get

b      1           2           3         4       5
a                   
x   0.333333    0.333333    0.333333    0.00    0.00
y   0.250000    0.250000    0.000000    0.25    0.25
z   0.000000    0.000000    1.000000    0.00    0.00

Where the rows represent the conditional probabilities P(b | a=A) for specific values of A. Notice, the rows sum to 1.

Upvotes: 11

maxymoo
maxymoo

Reputation: 36545

To find the total number of class b for each instance of class a you would do

df.groupby('a').b.value_counts()

For example, create a DataFrame as below:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

     A      B         C         D
0  foo    one -1.565185 -0.465763
1  bar    one  2.499516 -0.941229
2  foo    two -0.091160  0.689009
3  bar  three  1.358780 -0.062026
4  foo    two -0.800881 -0.341930
5  bar    two -0.236498  0.198686
6  foo    one -0.590498  0.281307
7  foo  three -1.423079  0.424715

Then:

df.groupby('A')['B'].value_counts()

A
bar  one      1
     two      1
     three    1
foo  one      2
     two      2
     three    1

To convert this to a conditional probability, you need to divide by the total size of each group.

You can either do it with another groupby:

df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()

A
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Or you can apply a lambda function onto the groups:

df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))

Upvotes: 22

Hamid K
Hamid K

Reputation: 1165

Consider the DataFrame that Maxymoo suggested:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

df
     A      B         C         D
0  foo    one  0.229206 -1.899999
1  bar    one  0.174972  0.328746
2  foo    two -1.384699 -1.691151
3  bar  three -1.008328 -0.915467
4  foo    two -0.065298 -0.107240
5  bar    two  1.871916  0.798135
6  foo    one  1.589609 -1.682237
7  foo  three  2.292783  0.639595

Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?

Approach 1:

df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B         
one    foo    0.666667
       bar    0.333333
three  foo    0.500000
       bar    0.500000
two    foo    0.666667
       bar    0.333333
dtype: float64

So the answer is: 0.6667

Approach 2:

Probability of x = one: 0.375

df['B'].value_counts()/df['B'].count()
one      0.375
two      0.375
three    0.250
dtype: float64

Probability of y = foo: 0.625

df['A'].value_counts()/df['A'].count()
foo    0.625
bar    0.375
dtype: float64

Probability of (x=one|y=foo): 0.4

df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A         
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64

Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667

Upvotes: 1

Carlos H Zelada
Carlos H Zelada

Reputation: 11

You could try this function,

def conprob(pd1,pd2,transpose=1):
    if transpose==0:
        table=pd.crosstab(pd1,pd2)
    else:
        table=pd.crosstab(pd2,pd1)
    cnames=table.columns.values
    weights=1/table[cnames].sum()
    out=table*weights
    pc=table[cnames].sum()/table[cnames].sum().sum()
    table=table.transpose()
    cnames=table.columns.values
    p=table[cnames].sum()/table[cnames].sum().sum()
    out['p']=p
    return out

This return de conditional probability P( row |column )

Upvotes: 1

cggarvey
cggarvey

Reputation: 585

You can pass in a list to groupby:

df.groupby(['a','b']).count()

Upvotes: 2

Related Questions