vonjd
vonjd

Reputation: 4380

Strange error when expanding data.table

We stumbled upon some strange behaviour trying to expand a data.table. The following code works alright:

dt <- data.table(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06

But using the following data.table

set.seed(1)
dt <- data.table(var1=sample(letters,1000,replace=T),var2=sample(LETTERS,1000,replace=T),freq=sample(1:10,1000,replace=T))

with the same code gives

Error in rep(1, freq) : invalid 'times' argument

My question
Might this be a bug in data.table?

(I got the syntax of the this example from R Machine Learning Essentials)

Edit
So the problem really seems to be with rep and not with data.table. The help page for rep says for the parameter times:

A integer vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1.

The second data.table creates times of different length than x which throws the error.

Upvotes: 3

Views: 367

Answers (1)

Frank
Frank

Reputation: 66819

My guess: when rep(x,times) is given a vector for times, it insists that x be the same length (instead of doing the natural thing in R and recycling). So manual recycling works:

dt[ ,.(rep(rep(1,.N),freq)), by=.(var1,var2)]

Seems to be a problem in base R (or maybe it's deliberate?), not in data.table. The OP didn't hit this problem in the first example because by=.(var1,var2) ensured that only one row was returned for each group, so the times argument was a scalar.

Upvotes: 6

Related Questions