Reputation: 2435
When I have to add missing observations I usually use CJ
in data.table
after setting the keys.
In the following example, two columns are the combinations of each other, and therefore take the same values. I want to add the missing combinations. What is happening when I use "only" l
instead of l[1:3]
?
Data:
library(data.table)
l <- letters[1:3]
DT <- CJ(from = l, to = l)
DT <- DT[, .(year = 1994:1995), .(from, to)]
DT[, g := runif(.N)]
# drop some rows
DT <- DT[-c(2, 6)]
# set the keys
setkey(DT, from, to, year)
head(DT)
# from to year g
# 1: a a 1994 0.6436420
# 2: a b 1994 0.6974629
# 3: a b 1995 0.5686354
# 4: a c 1994 0.6436957
# 5: b a 1994 0.6219831
# 6: b a 1995 0.7191914
# unexpected result:
DT[CJ(l,l,1994:1995)]
# from to year g l
# 1: a a 1994 0.6436420 a
# 2: a a 1995 0.6436420 a
# 3: a a 1994 0.6436420 b
# 4: a a 1995 0.6436420 b
# 5: a a 1994 0.6436420 c
# 6: a a 1995 NA c
# 7: b b 1994 0.7617087 a
# 8: b b 1995 0.7617087 a
# 9: b b 1994 0.7617087 b
# 10: b b 1995 0.7617087 b
# 11: b b 1994 0.7617087 c
# 12: b b 1995 0.8479245 c
# 13: c c 1994 0.4891899 a
# 14: c c 1995 0.4891899 a
# 15: c c 1994 0.4891899 b
# 16: c c 1995 0.4891899 b
# 17: c c 1994 0.4891899 c
# 18: c c 1995 0.7520622 c
# what I expected:
DT[CJ(l[1:3],l[1:3],1994:1995)]
# from to year g
# 1: a a 1994 0.64364200
# 2: a a 1995 NA
# 3: a b 1994 0.69746294
# 4: a b 1995 0.56863539
# 5: a c 1994 0.64369566
# 6: a c 1995 NA
# 7: b a 1994 0.62198311
# 8: b a 1995 0.71919139
# 9: b b 1994 0.76170866
# 10: b b 1995 0.84792449
# 11: b c 1994 0.15793127
# 12: b c 1995 0.26623733
# 13: c a 1994 0.89921463
# 14: c a 1995 0.55417635
# 15: c b 1994 0.38938166
# 16: c b 1995 0.03778206
# 17: c c 1994 0.48918988
# 18: c c 1995 0.75206221
Upvotes: 1
Views: 328
Reputation: 34703
I believe the issue is that CJ(l, l, 1994:1995)
has duplicate names. This is hinted at by verbose=TRUE
:
DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu)
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.
Anyway, you can get around this by naming the CJ
arguments:
DT[CJ(from = l, to = l, year = 1994:1995)]
# from to year g
# 1: a a 1994 0.64364200
# 2: a a 1995 NA
# 3: a b 1994 0.69746294
# 4: a b 1995 0.56863539
# 5: a c 1994 0.64369566
# 6: a c 1995 NA
# 7: b a 1994 0.62198311
# 8: b a 1995 0.71919139
# 9: b b 1994 0.76170866
# 10: b b 1995 0.84792449
# 11: b c 1994 0.15793127
# 12: b c 1995 0.26623733
# 13: c a 1994 0.89921463
# 14: c a 1995 0.55417635
# 15: c b 1994 0.38938166
# 16: c b 1995 0.03778206
# 17: c c 1994 0.48918988
# 18: c c 1995 0.75206221
Note that we could also accomplish this without keys:
setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
# it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]
Upvotes: 2