desval
desval

Reputation: 2435

adding missing observations in data.table

When I have to add missing observations I usually use CJ in data.table after setting the keys.

In the following example, two columns are the combinations of each other, and therefore take the same values. I want to add the missing combinations. What is happening when I use "only" l instead of l[1:3]?

Data:

library(data.table)

l <- letters[1:3]

DT <- CJ(from = l, to = l)
DT <- DT[, .(year = 1994:1995), .(from, to)]
DT[, g := runif(.N)]

# drop some rows
DT <- DT[-c(2, 6)]

# set the keys
setkey(DT, from, to, year)


head(DT)
#    from to year         g
# 1:    a  a 1994 0.6436420
# 2:    a  b 1994 0.6974629
# 3:    a  b 1995 0.5686354
# 4:    a  c 1994 0.6436957
# 5:    b  a 1994 0.6219831
# 6:    b  a 1995 0.7191914

# unexpected result:
DT[CJ(l,l,1994:1995)]
#     from to year         g l
#  1:    a  a 1994 0.6436420 a
#  2:    a  a 1995 0.6436420 a
#  3:    a  a 1994 0.6436420 b
#  4:    a  a 1995 0.6436420 b
#  5:    a  a 1994 0.6436420 c
#  6:    a  a 1995        NA c
#  7:    b  b 1994 0.7617087 a
#  8:    b  b 1995 0.7617087 a
#  9:    b  b 1994 0.7617087 b
# 10:    b  b 1995 0.7617087 b
# 11:    b  b 1994 0.7617087 c
# 12:    b  b 1995 0.8479245 c
# 13:    c  c 1994 0.4891899 a
# 14:    c  c 1995 0.4891899 a
# 15:    c  c 1994 0.4891899 b
# 16:    c  c 1995 0.4891899 b
# 17:    c  c 1994 0.4891899 c
# 18:    c  c 1995 0.7520622 c

# what I expected:
DT[CJ(l[1:3],l[1:3],1994:1995)]

#     from to year          g
#  1:    a  a 1994 0.64364200
#  2:    a  a 1995         NA
#  3:    a  b 1994 0.69746294
#  4:    a  b 1995 0.56863539
#  5:    a  c 1994 0.64369566
#  6:    a  c 1995         NA
#  7:    b  a 1994 0.62198311
#  8:    b  a 1995 0.71919139
#  9:    b  b 1994 0.76170866
# 10:    b  b 1995 0.84792449
# 11:    b  c 1994 0.15793127
# 12:    b  c 1995 0.26623733
# 13:    c  a 1994 0.89921463
# 14:    c  a 1995 0.55417635
# 15:    c  b 1994 0.38938166
# 16:    c  b 1995 0.03778206
# 17:    c  c 1994 0.48918988
# 18:    c  c 1995 0.75206221

Upvotes: 1

Views: 328

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34703

I believe the issue is that CJ(l, l, 1994:1995) has duplicate names. This is hinted at by verbose=TRUE:

DT[CJ(l,l,1994:1995), verbose=TRUE]
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'character' length 3
# forder.c received a vector type 'integer' length 2
# i.l has same type (character) as x.from. No coercion needed.
# i.l has same type (character) as x.to. No coercion needed.
# i.V3 has same type (integer) as x.year. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

This is in a gray area between being a bug or not... better behavior might be to error instead of proceed with potentially wrong results.

Anyway, you can get around this by naming the CJ arguments:

DT[CJ(from = l, to = l, year = 1994:1995)]
#     from to year          g
#  1:    a  a 1994 0.64364200
#  2:    a  a 1995         NA
#  3:    a  b 1994 0.69746294
#  4:    a  b 1995 0.56863539
#  5:    a  c 1994 0.64369566
#  6:    a  c 1995         NA
#  7:    b  a 1994 0.62198311
#  8:    b  a 1995 0.71919139
#  9:    b  b 1994 0.76170866
# 10:    b  b 1995 0.84792449
# 11:    b  c 1994 0.15793127
# 12:    b  c 1995 0.26623733
# 13:    c  a 1994 0.89921463
# 14:    c  a 1995 0.55417635
# 15:    c  b 1994 0.38938166
# 16:    c  b 1995 0.03778206
# 17:    c  c 1994 0.48918988
# 18:    c  c 1995 0.75206221

Note that we could also accomplish this without keys:

setkey(DT, NULL)
# for those more familiar with SQL syntax, this is a NATURAL JOIN;
#   it's equivalent to `on = c("from", "to", "year")`
DT[CJ(from = l, to = l, year = 1994:1995), on = .NATURAL]

Upvotes: 2

Related Questions