skan
skan

Reputation: 7720

Easier way to calculate conditioned proportions with R data.table?

Let's say we have this toy example:

library(data.table)
temp <- data.table(first=c("A", "A","A", "A","B","C","C"), 
   sec=c("X", "X","X", "Y","X",  "Z","Z"), stringsAsFactors = T))

first sec
  A   X
  A   X
  A   X
  A   Y
  B   X
  C   Z
  C   Z

I would like to get a third column stating the proportion of times that combination happens among the occurrences of the first column.

I got to do it with data.table in the following way:

temp[,N1:=.N,by=.(first, sec)]
temp[,N2:=.N,by=first]
temp[, prop := N1/N2]
temp[,c("N1","N2"):=NULL]


first sec prop
   A   X 0.75
   A   X 0.75
   A   X 0.75
   A   Y 0.25
   B   X 1.00
   C   Z 1.00
   C   Z 1.00

That means that A,X happens 3 times. A happens 4 times, thus AX happens the 75% of the time the first letter is A.

Couldn't that be done easier?

Kind of

temp[,.N(first,sec)/.N(first)]

Upvotes: 3

Views: 690

Answers (2)

mt1022
mt1022

Reputation: 17289

What about doing grouping inside each first-level group with ave:

temp[, prop := ave(as.numeric(sec), sec, FUN = length) / .N, by = .(first)]
# > temp
#    first sec prop
# 1:     A   X 0.75
# 2:     A   X 0.75
# 3:     A   X 0.75
# 4:     A   Y 0.25
# 5:     B   X 1.00
# 6:     C   Z 1.00
# 7:     C   Z 1.00

Inspired by: https://stackoverflow.com/a/32003058/3926543

Another way: chain commands:

temp[,N:=.N,by=.(first, sec)][, prop := N / .N, by = .(first)][, N := NULL]

when first and sec have many levels and temp has many more rows, chaining commands or the solution in OP's question will be faster than ave solution.

Upvotes: 5

lmo
lmo

Reputation: 38490

Here is an alternative. It is not simpler, in my mind, but sort of embodies the idea that you have. The idea is to count the separate values of sec within first using table, return the proportion and the corresponding value of sec, drop counts that are 0, and then join the returned data.table onto the original.

temp[temp[, {cnt=table(sec); .(sec=names(cnt), prop=c(cnt) / .N)}, by=first][prop > 0],
     on=c("first", "sec")]
   first sec prop
1:     A   X 0.75
2:     A   X 0.75
3:     A   X 0.75
4:     A   Y 0.25
5:     B   X 1.00
6:     C   Z 1.00
7:     C   Z 1.00

Upvotes: 2

Related Questions