Reputation: 8031
I am using the Hmisc
Package to calculate the quantiles of two continous variables and compare the results in a crosstable. You find my code below.
My problem is that the calculation of the quantiles takes a considerable amount of time if the number of observations increases.
Is there any possibility to speed up this procedure by using the data.table
, ddply
or any other package?
Thanks.
library(Hmisc)
# Set seed
set.seed(123)
# Generate some data
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
c <- data.frame(a,b)
# Calculate quantiles
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)
# Output some descriptives
summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
# Time spent for calculation:
# User System verstrichen
# 25.13 3.47 28.73
Upvotes: 4
Views: 1799
Reputation: 8031
As stated by jlhoward and Ricardo Saporta data.table
doesn't seem to speed up things too much in this case. The cut2
function is clearly the bottleneck here. I used another function to calculate the quantiles (see Is there a better way to create quantile "dummies" / factors in R?) and was able to decrease the calculation time by half:
qcut <- function(x, n) {
if(n<=2)
{
stop("The sample must be split in at least 3 parts.")
}
else{
break.values <- quantile(x, seq(0, 1, length = n + 1), type = 7)
break.labels <- c(
paste0(">=",break.values[1], " & <=", break.values[2]),
sapply(break.values[3:(n)], function(x){paste0(">",break.values[which(break.values == x)-1], " & <=", x)}),
paste0(">",break.values[(n)], " & <=", break.values[(n+1)]))
cut(x, break.values, labels = break.labels,include.lowest = TRUE)
}
}
c$a.quantile.2 <- qcut(c$a, 5)
c$b.quantile.2 <- qcut(c$b, 5)
summaryM(a.quantile.2 ~ b.quantile.2, data=c, overall=TRUE)
# Time spent for calculation:
# User System verstrichen
# 10.22 1.47 11.70
Using data.table
would reduce the calculation time by another second, but I like the summary by the Hmisc
package better.
Upvotes: 3
Reputation: 59365
Data tables don't seem to improve things here:
library(Hmisc)
set.seed(123)
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
library(data.table)
# original approach
system.time({
c <- data.frame(a,b)
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)
smry.1 <-summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
})
user system elapsed
72.79 6.22 79.02
# original data.table approach
system.time({
DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
smry.2 <- DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
})
user system elapsed
66.86 5.11 71.98
# different data.table approach (simpler, and uses table(...))
system.time({
dt <- data.table(a,b)
smry.3 <- table(dt[,lapply(dt,cut2,g=5)])
})
user system elapsed
67.24 5.02 72.26
Upvotes: 1
Reputation: 55360
You can use data.table
's .N
built in variable, to quickly tabulate.
library(data.table)
library(Hmisc)
DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
You can break that last line down into two steps, to see what is going on. The second "[ "
simply reshapes the data in a clean format.
DT.tabulated <- DT[, .N, keyby=list(b.quantile, a.quantile)]
DT.tabulated
DT.tabulated[, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
Upvotes: 2