Reputation: 126
I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:
I have a huge data.table with multiple columns, example:
x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c
if I select
dataDT[x==‘1’]
I end up getting
x y
1: 1 a
whereas
dataDT[(x==‘1’)]
gives me
x y
1: 1 a
2: 1 b
3: 1 c
Any ideas? x and y are factor and the data.table is indexed by setKey
by x
.
ADDITIONAL INFOS AND CODE:
I actually fixed this issue but in a way that is not clear nor intuitive.
My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.
I have previously used the following notation
dataT[,nC:=oC,]
to do the deed.
I have instead found that creating the new column by using
dataT$nC <- dataT$oC
instead fixes the bug completely.
I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.
With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.
Also, interestingly enough, while performing
dataDT[x==‘1’]
vs
dataDT[(x==‘1’)]
shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.
rm(list=ls())
library(data.table)
superParF <- function(dtInput){
dtInputP <- dtInput[a==1]
dtInputN <- dtInput[a==2]
outDT <- rbind(dtInputP[,sum(y),by='x'],
dtInputN[,sum(y),by='x'])
return(outDT)
}
superFunction <- function(dtInput){
#create new column
dtInput[,z:=y,]
#run function
outDT <- rbindlist(lapply(unique(inputDT$x),
function(i)
superParF(inputDT[x==i])))
#output outDT
return(outDT)
}
inputDT <- data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep(4,100000),
rep(5,100000)),
y= c(rep(1:100000,5)))
inputDT$x <- as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)
inputDT <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))
setkey(inputDT,x)
#first observation-> the two searches do not work with the same performance
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
out <- superFunction(inputDT)
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
inputDT
Upvotes: 1
Views: 198
Reputation: 126
Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!
data.table not recognising logical in filter
Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.
@Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!
Upvotes: 1
Reputation: 59612
I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :
Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.
So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :
21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks @mplatzer.
26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks @huashan for the nice reproducible example.
Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.
The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.
We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :
DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).
So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?
Upvotes: 3