Reputation: 7720
Let's say we have this toy example:
library(data.table)
temp <- data.table(first=c("A", "A","A", "A","B","C","C"),
sec=c("X", "X","X", "Y","X", "Z","Z"), stringsAsFactors = T))
first sec
A X
A X
A X
A Y
B X
C Z
C Z
I would like to get a third column stating the proportion of times that combination happens among the occurrences of the first column.
I got to do it with data.table in the following way:
temp[,N1:=.N,by=.(first, sec)]
temp[,N2:=.N,by=first]
temp[, prop := N1/N2]
temp[,c("N1","N2"):=NULL]
first sec prop
A X 0.75
A X 0.75
A X 0.75
A Y 0.25
B X 1.00
C Z 1.00
C Z 1.00
That means that A,X happens 3 times. A happens 4 times, thus AX happens the 75% of the time the first letter is A.
Couldn't that be done easier?
Kind of
temp[,.N(first,sec)/.N(first)]
Upvotes: 3
Views: 690
Reputation: 17289
What about doing grouping inside each first-level group with ave
:
temp[, prop := ave(as.numeric(sec), sec, FUN = length) / .N, by = .(first)]
# > temp
# first sec prop
# 1: A X 0.75
# 2: A X 0.75
# 3: A X 0.75
# 4: A Y 0.25
# 5: B X 1.00
# 6: C Z 1.00
# 7: C Z 1.00
Inspired by: https://stackoverflow.com/a/32003058/3926543
Another way: chain commands:
temp[,N:=.N,by=.(first, sec)][, prop := N / .N, by = .(first)][, N := NULL]
when first
and sec
have many levels and temp
has many more rows,
chaining commands or the solution in OP's question will be faster than ave
solution.
Upvotes: 5
Reputation: 38490
Here is an alternative. It is not simpler, in my mind, but sort of embodies the idea that you have. The idea is to count the separate values of sec within first using table
, return the proportion and the corresponding value of sec, drop counts that are 0, and then join the returned data.table onto the original.
temp[temp[, {cnt=table(sec); .(sec=names(cnt), prop=c(cnt) / .N)}, by=first][prop > 0],
on=c("first", "sec")]
first sec prop
1: A X 0.75
2: A X 0.75
3: A X 0.75
4: A Y 0.25
5: B X 1.00
6: C Z 1.00
7: C Z 1.00
Upvotes: 2