Reputation: 4614
I want to know how to vectorize and memoize a custom function in R. It seems my way of thinking is not aligned with R's way of operation. So, I gladly welcome any links to good reading material. For example, R inferno is a nice resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache
packages?
I haven't been able to find any other discussions on this subject. Searching for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching for those keywords at http://r-project.markmail.org/ does not return helpful discussions. I emailed the mailing list and did not receive a complete answer.
I am not solely interested in memoizing the GC function, and I am aware of Bioconductor and the various packages available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank ""
.
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise
or
R.cache
R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash
package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C
should have a
value of 100
, not NULL
.
Note that using has.key(s, cache)
instead of exists(s, cache)
results
in the same output. Also, using cache[s] <<- result
instead of
cache[[s]] <<- result
results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Upvotes: 4
Views: 725
Reputation: 176668
This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}
Upvotes: 1
Reputation: 4469
While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06
Upvotes: 2