nbafrank
nbafrank

Reputation: 126

Strange issue with data.table row search

I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:

I have a huge data.table with multiple columns, example:

   x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c

if I select

dataDT[x==‘1’]  

I end up getting

   x y
1: 1 a

whereas

dataDT[(x==‘1’)]

gives me

   x y
1: 1 a
2: 1 b
3: 1 c

Any ideas? x and y are factor and the data.table is indexed by setKey by x.

ADDITIONAL INFOS AND CODE:

I actually fixed this issue but in a way that is not clear nor intuitive.

My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.

I have previously used the following notation

dataT[,nC:=oC,]

to do the deed.

I have instead found that creating the new column by using

dataT$nC <- dataT$oC

instead fixes the bug completely.

I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.

With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.

Also, interestingly enough, while performing

dataDT[x==‘1’]

vs

dataDT[(x==‘1’)]

shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.

rm(list=ls())
library(data.table)


superParF <- function(dtInput){

  dtInputP <- dtInput[a==1]
  dtInputN <- dtInput[a==2]

  outDT    <- rbind(dtInputP[,sum(y),by='x'],
                    dtInputN[,sum(y),by='x'])
  return(outDT)
}

superFunction <- function(dtInput){

  #create new column
  dtInput[,z:=y,]

  #run function
  outDT <- rbindlist(lapply(unique(inputDT$x),
                        function(i)
                          superParF(inputDT[x==i])))
  #output outDT
  return(outDT)
}




inputDT <- data.table(x = c(rep(1,100000),
                        rep(2,100000),
                        rep(3,100000),
                        rep(4,100000),
                        rep(5,100000)),
                  y= c(rep(1:100000,5)))

inputDT$x <-  as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)

inputDT   <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))

setkey(inputDT,x)

#first observation-> the two searches do not work with the same performance

a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])

print(a)
print(b)

out <- superFunction(inputDT)

a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])

print(a)
print(b)

inputDT

Upvotes: 1

Views: 198

Answers (2)

nbafrank
nbafrank

Reputation: 126

Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!

data.table not recognising logical in filter

Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.

@Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!

Upvotes: 1

Matt Dowle
Matt Dowle

Reputation: 59612

I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :

Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.

So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :

21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks @mplatzer.

26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks @huashan for the nice reproducible example.

Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.

The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.

We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :

DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).

So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?

Upvotes: 3

Related Questions