Riccardo
Riccardo

Reputation: 763

Create a new column using vertical conditions with data.table

I am trying to use data.table for an operation I don't if data.table is designed for. Suppose I have the following

set.seed(1)
id<-rep(LETTERS[1:6], each=3)
event<-rep(letters[1:4], each=3, length.out=18)
order<-rep(1:3, 6)
var<-rnorm(18)

dt<-data.table(id, event, order, var)
setkey(dt, id)

    id event order         var
 1:  A     a     1 -0.62645381
 2:  A     a     2  0.18364332
 3:  A     a     3 -0.83562861
 4:  B     b     1  1.59528080
 5:  B     b     2  0.32950777
 6:  B     b     3 -0.82046838
 7:  C     c     1  0.48742905
 8:  C     c     2  0.73832471
 9:  C     c     3  0.57578135
10:  D     d     1 -0.30538839
11:  D     d     2  1.51178117
12:  D     d     3  0.38984324
13:  E     a     1 -0.62124058
14:  E     a     2 -2.21469989
15:  E     a     3  1.12493092
16:  F     b     1 -0.04493361
17:  F     b     2 -0.01619026
18:  F     b     3  0.94383621

and I need to calculate in a new column the Euclidean distance between the values in var having different id but same event type, maintaining the given order of observations.

I understood that with the grouping by command in the data.table statement I divide the data into chunks which I can use in connection with other chunks from other data.tables but can I do the same thing within the same data.table?

To be clear, this is what I'd like to obtain

    id event order         var   euclid
 1:  A     a     1 -0.62645381 3.097720
 2:  A     a     2  0.18364332 3.097720
 3:  A     a     3 -0.83562861 3.097720
 4:  B     b     1  1.59528080 2.433635
 5:  B     b     2  0.32950777 2.433635
 6:  B     b     3 -0.82046838 2.433635
 7:  C     c     1  0.48742905       NA
 8:  C     c     2  0.73832471       NA
 9:  C     c     3  0.57578135       NA
10:  D     d     1 -0.30538839       NA
11:  D     d     2  1.51178117       NA
12:  D     d     3  0.38984324       NA
13:  E     a     1 -0.62124058 3.097720
14:  E     a     2 -2.21469989 3.097720
15:  E     a     3  1.12493092 3.097720
16:  F     b     1 -0.04493361 2.433635
17:  F     b     2 -0.01619026 2.433635
18:  F     b     3  0.94383621 2.433635

Many thanks!

Upvotes: 3

Views: 181

Answers (1)

thelatemail
thelatemail

Reputation: 93813

Not sure about efficiency, but this should work at a basic level:

dt[, euclid:= dist(xtabs(var ~ id + order, data=.SD)), by=event]

#    id event order         var   euclid
# 1:  A     a     1 -0.62645381 3.097720 # snip
# 4:  B     b     1  1.59528080 2.433635 # snip
# 7:  C     c     1  0.48742905       NA # snip
#10:  D     d     1 -0.30538839       NA # snip
#13:  E     a     1 -0.62124058 3.097720 # snip
#16:  F     b     1 -0.04493361 2.433635 # snip

To account for @Arun's issue where there are 3+ groups, you can sum the dist outputs, though you will get 0 instead of NA's where there is only one group:

dt[, euclid := sum(dist(xtabs(var ~ id + order, data=.SD))), by=event]

Upvotes: 1

Related Questions