Reputation: 7063
I have to subset a sequence of data.frame
s frequently (millions of times each run). The data.frame
s are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame
change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame
/data.table
The following minimum reproducible example shows, that data.frame
seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
data.frame
or data.table
). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.data.frames
but just one data.frame
, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.Upvotes: 1
Views: 177
Reputation: 6277
You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame
is numerical (or can be converted without too much trouble).
Here we go. First I modified the data to have it with size 200x30:
library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors
And the comparison, ranked from quickest to slowest:
summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100
If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.
Upvotes: 1