Reputation: 4380
We stumbled upon some strange behaviour trying to expand a data.table. The following code works alright:
dt <- data.table(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
## user system elapsed
## 0.05 0.01 0.06
But using the following data.table
set.seed(1)
dt <- data.table(var1=sample(letters,1000,replace=T),var2=sample(LETTERS,1000,replace=T),freq=sample(1:10,1000,replace=T))
with the same code gives
Error in rep(1, freq) : invalid 'times' argument
My question
Might this be a bug in data.table
?
(I got the syntax of the this example from R Machine Learning Essentials)
Edit
So the problem really seems to be with rep
and not with data.table
. The help page for rep
says for the parameter times
:
A integer vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1.
The second data.table
creates times
of different length than x
which throws the error.
Upvotes: 3
Views: 367
Reputation: 66819
My guess: when rep(x,times)
is given a vector for times
, it insists that x
be the same length (instead of doing the natural thing in R and recycling). So manual recycling works:
dt[ ,.(rep(rep(1,.N),freq)), by=.(var1,var2)]
Seems to be a problem in base R (or maybe it's deliberate?), not in data.table
. The OP didn't hit this problem in the first example because by=.(var1,var2)
ensured that only one row was returned for each group, so the times
argument was a scalar.
Upvotes: 6