Reputation: 91
I need to understand why sometimes you need to use a "by", and sometimes you don't. I'm really new to both R and data.table, so it is probably something basic.
a<-c("A","B","C")
b<-c("AA","BBB","CCC")
x1<-c(2,4,8)
x2<-c(2,4,1)
n1<-c(9,9,9)
n2<-c(10,10,10)
DT <-data.table(a,b,x1,x2,n1,n2)
test1 <- DT[,.(y=nchar(b))]
test2 <- DT[,.(pv1=prop.test(c(x1,x2), c(n1,n2))$p.value)]
test3 <- DT[,.(pv1=prop.test(c(x1,x2), c(n1,n2))$p.value), by= 'a']
test1 behaves as I expected, it returns a data table with 3 observations and 1 variable.
test2 confused me. I get get only 1 observation back
test3 is how I got the answer I expected.
I don't understand why test2 did not operate row-wise like test1 did. When do you need to use a by= if you want to process every row in the table?
Thanks for your help,
David
Upvotes: 1
Views: 55
Reputation: 620
It does operate row-wise. It's just that, while nchar() takes a vector as its argument and returns a vector, functions like prop.test(), sum(), mean() etc. take a vector (or vectors) and return a single value. Thus, without a 'by' argument, the function will operate across the whole data table (no sub-groupings) and return a single value.
Upvotes: 3