iGada
iGada

Reputation: 633

Computing conditional probabilities for dummy variables in R

Consider the following data frame named mydata.

id  s1  s2  s3  t1  t2  t3
1   1   0   0   0   1   0
2   0   0   1   0   0   1
3   1   0   0   1   0   0
4   0   1   0   0   1   0
5   0   1   0   1   0   0
6   0   0   1   0   0   1
7   0   0   1   0   1   0
8   1   0   0   0   0   1
9   0   1   0   0   0   1
10  0   0   1   0   0   1

My intention is to get the conditional proportion for each t_i given s_i. For example, the conditional proportion for t1 given s1 is computed as: (no of s1==1 & t1==1)/(no of s1==1) = 1/3. Thus, I want to repeat this for all possible combinations using for loop in R.

Any help is highly appreciated. Tnx!

Upvotes: 0

Views: 184

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269556

We show how to do this without looping by using matrix math and in a special case which does cover the sample input shown in the question using regression.

Get the s columns as a matrix mats and the t columns as a matrix matt. Then use the matrix expression shown and optionally add the row names.

nms <- names(mydata)

is <- startsWith(nms, "s")
it <- startsWith(nms, "t")

mats <- as.matrix(mydata[is])
matt <- as.matrix(mydata[it])

crossprod(mats, matt) / colSums(mats)

giving:

          t1        t2        t3
s1 0.3333333 0.3333333 0.3333333
s2 0.3333333 0.3333333 0.3333333
s3 0.0000000 0.2500000 0.7500000

As a double check note that the s1/t1 cell in the above matrix is 1/3 as in the question.

Orthogonal mats

In the question there is exactly one 1 in each row of the s columns and if that is the general case (in general we just need the columns of mats to be orthogonal) then the result can be obtained as the regression coefficients of the following regression:

coef( lm(cbind(t1, t2, t3) ~ s1 + s2 + s3 + 0, mydata))

giving:

             t1        t2        t3
s1 3.333333e-01 0.3333333 0.3333333
s2 3.333333e-01 0.3333333 0.3333333
s3 5.551115e-17 0.2500000 0.7500000

or equivalently (except for slightly different row names):

coef(lm(matt ~ mats + 0))

or

solve(crossprod(mats), crossprod(mats, matt))

Note

The input mydata in reproducible form is assumed to be:

Lines <- "
id  s1  s2  s3  t1  t2  t3
1   1   0   0   0   1   0
2   0   0   1   0   0   1
3   1   0   0   1   0   0
4   0   1   0   0   1   0
5   0   1   0   1   0   0
6   0   0   1   0   0   1
7   0   0   1   0   1   0
8   1   0   0   0   0   1
9   0   1   0   0   0   1
10  0   0   1   0   0   1"
mydata <- read.table(text = Lines, header = TRUE)

Upvotes: 3

akrun
akrun

Reputation: 887108

We could use Map

Map(function(x, y) (x & y)/sum(y), mydata[startsWith(names(mydata), 't')],
          mydata[startsWith(names(mydata), 's')])

Upvotes: 0

Related Questions