Reputation: 13
I have a matrix of size 10000 x 100 and a vector of length 100. I'd like to apply a custom function, percentile, which takes in a vector argument and a scalar argument, to each column of the matrix such that on iteration j, the arguments used with percentile are column j of the matrix and entry j of the vector. Is there a way to use one of the apply functions to do this?
Here's my code. It runs, but doesn't return the correct result.
percentile <- function(x, v){
length(x[x <= v]) / length(x)
}
X <- matrix(runif(10000 * 100), nrow = 10000, ncol = 100)
y <- runif(100)
result <- apply(X, 2, percentile, v = y)
The workaround that I've been using has been to just append y to X, and re-write the percentile function, as shown below.
X <- rbind(X, y)
percentile2 <- function(x){
v <- x[length(x)]
x <- x[-length(x)]
length(x[x <= v]) / length(x)
}
result <- apply(X, 2, percentile2)
This code does return the correct result, but I would prefer something a bit more elegant.
Upvotes: 1
Views: 387
Reputation: 59970
If you understand that R
is vectorised and know the right functions you can avoid loops entirely, and do the whole thing in one relatively simple line...
colSums( t( t( X ) <= y ) ) / nrow( X )
Through vectorisation R will recycle each element in y
across each column of X
(by default it will do this across the rows, so we use the transpose function t
to turn the columns to rows, apply the logical comparison <=
and then transpose back again.
Since TRUE
and FALSE
evaluate to 1 and 0 respectively we can use colSums
to effectively get the number of rows in each column which met the condition and then divde each column by the total number of rows (remember the recycling rule!). It is the exact same result....
res1 <- apply(X2, 2, percentile2)
res2 <- colSums( t( t( X ) <= y ) ) / nrow( X )
identical( res1 , res2 )
[1] TRUE
Obviously as this doesn't use any R loops it's a lot quicker (~10 times on this small matrix).
Even better would be to use rowMeans
like this (thanks to @flodel):
rowMeans( t(X) <= y )
Upvotes: 2
Reputation: 8105
I think the easiest and clearest way is to use a for
loop:
result2 <- numeric(ncol(X))
for (i in seq_len(ncol(X))) {
result2[i] <- sum(X[,i] <= y[i])
}
result2 <- result2 / nrow(X)
the fastest and shortest solution I can think of is:
result1 <- rowSums(t(X) <= y) / nrow(X)
SimonO101 has an explanation in his answer how this works. As I said, it is fast. However, the disadvantage is that it is less clear what exactly is calculated here, although you could solve this by placing this piece of code in a well-named function.
flodel also suggester a solution using mapply
which is an apply
that can work on multiple vectors. However, for that to work you first need to put each of your columns or your matrix in a list
or data.frame
:
result3 <- mapply(percentile, as.data.frame(X), y)
Speed wise (see below for some benchmarking) the for-loop doesn't do that bad and it's faster than using apply
(in this case at least). The trick with rowSums
and vector recycling is faster, over 10 times as fast as the solution using apply
.
> X <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
> y <- runif(100)
>
> system.time({result1 <- rowSums(t(X) <= y) / nrow(X)})
user system elapsed
0.020 0.000 0.018
>
> system.time({
+ X2 <- rbind(X, y)
+ percentile2 <- function(x){
+ v <- x[length(x)]
+ x <- x[-length(x)]
+ length(x[x <= v]) / length(x)
+ }
+ result <- apply(X2, 2, percentile2)
+ })
user system elapsed
0.252 0.000 0.249
>
>
> system.time({
+ result2 <- numeric(ncol(X))
+ for (i in seq_len(ncol(X))) {
+ result2[i] <- sum(X[,i] <= y[i])
+ }
+ result2 <- result2 / nrow(X)
+ })
user system elapsed
0.024 0.000 0.024
>
> system.time({
+ result3 <- mapply(percentile, as.data.frame(X), y)
+ })
user system elapsed
0.076 0.000 0.073
>
> all(result2 == result1)
[1] TRUE
> all(result2 == result)
[1] TRUE
> all(result3 == result)
[1] TRUE
Upvotes: 2