Reputation: 185
I have a data set with ~200k rows and I want to calculate percentile scores for multiple variables. The method that I am using takes ~10 mins for a single variable. Is there any efficient way to do this. Following is a fake data set my code.
library(dplyr)
library(purrr)
id <- c(1:200000)
X <- rnorm(200000,mean = 5,sd=100)
DATA <- data.frame(ID =id,Var = X)
percentileCalc <- function(value){
per_rank <- ((sum(DATA$Var < value)+(0.5*sum(DATA$Var == value)))/length(DATA$Var))
return(per_rank)
}
First Method:
res <- numeric(length = length(DATA$Var))
sta <- Sys.time()
for (i in seq_along(DATA$Var)) {
res[i]<-percentileCalc(DATA$Var[i])
}
sto <- Sys.time()
sto - sta
Output:
Time difference of 10.51337 mins
Second Method:
sta <- Sys.time()
res <- map(DATA$Var,percentileCalc)
sto <- Sys.time()
sto - sta
Output:
Time difference of 6.86872 mins
Third Method:
sta <- Sys.time()
res <- sapply(DATA$Var,percentileCalc)
sto <- Sys.time()
sto - sta
Output:
Time difference of 11.1495 mins
Next I tried a simple element wise operation but it still took time
simpleOperation <- function(value){
per_rank <- sum(DATA$Var < value)
return(per_rank)
}
res <- numeric(length = length(DATA$Var))
sta <- Sys.time()
for (i in seq_along(DATA$Var)) {
res[i]<-simpleOperation(DATA$Var[i])
}
sto <- Sys.time()
sto - sta
Time difference of 3.369287 mins
sta <- Sys.time()
res <- map(DATA$Var,simpleOperation)
sto <- Sys.time()
sto - sta
Time difference of 3.979965 mins
sta <- Sys.time()
res <- sapply(DATA$Var,simpleOperation)
sto <- Sys.time()
sto - sta
Time difference of 6.535737 mins
There is percent_rank() available in dplyr which does kind of same thing, but my concern here is that even a simple operation is taking time when iterated over each element of a variable. May be I am doing something wrong.
Following is my session info:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] purrr_0.2.2 dplyr_0.5.0
loaded via a namespace (and not attached):
[1] compiler_3.4.0 lazyeval_0.2.0 magrittr_1.5 R6_2.2.0 assertthat_0.1 DBI_0.5-1 tools_3.4.0
[8] tibble_1.2 Rcpp_0.12.10
Upvotes: 4
Views: 228
Reputation: 11728
Seems to me that you are implementing (rank(DATA$Var) - 0.5) / length(DATA$Var)
.
Verification with your data and some data with not only unique values:
N <- 1e4
DATA <- data.frame(
ID = 1:N,
Var = rnorm(N, mean = 5, sd = 100),
Var2 = sample(0:10, size = N, replace = TRUE)
)
percentileCalc <- function(value) {
(sum(DATA$Var < value) + 0.5 * sum(DATA$Var == value)) / length(DATA$Var)
}
percentileCalc2 <- function(value) {
(sum(DATA$Var2 < value) + 0.5 * sum(DATA$Var2 == value)) / length(DATA$Var2)
}
all.equal((rank(DATA$Var) - 0.5) / length(DATA$Var),
sapply(DATA$Var, percentileCalc))
all.equal((rank(DATA$Var2) - 0.5) / length(DATA$Var2),
sapply(DATA$Var2, percentileCalc2))
simpleOperation <- function(value) {
sum(DATA$Var < value)
}
simpleOperation2 <- function(value) {
sum(DATA$Var2 < value)
}
all.equal(rank(DATA$Var, ties.method = "min") - 1,
sapply(DATA$Var, simpleOperation))
all.equal(rank(DATA$Var2, ties.method = "min") - 1,
sapply(DATA$Var2, simpleOperation2))
Upvotes: 1