Reputation: 20445
I have to do extensive data manipulation on a big data set (using data.table, RStudio mostly). I would like to monitor run time for each of my step without explicitly call system.time() on each step.
Is there a package or an easy way to show run time by default on each step?
Thank you.
Upvotes: 5
Views: 279
Reputation: 23241
I have to give full credit to @jbecker of Freenode
's #R
IRC channel for this extra answer, but for me the solution is here: http://adv-r.had.co.nz/Profiling.html
Here's just a little taste of it:
"To understand performance, you use a profiler. There are a number of different types of profilers. R uses a fairly simple type called a sampling or statistical profiler. A sampling profiler stops the execution of code every few milliseconds and records which function is currently executing (along with which function called that function, and so on). For example, consider f(), below:"
library(lineprof)
f <- function() {
pause(0.1)
g()
h()
}
g <- function() {
pause(0.1)
h()
}
h <- function() {
pause(0.1)
}
Upvotes: 0
Reputation: 103948
It's not exactly what you're asking for, but I've written time_file
(https://gist.github.com/4183595) which source()
s an R file, and runs the code, then rewrites the file, inserting comments containing how long each top-level statement took to run.
i.e. time_file()
turns this:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
into this:
{
load_all("~/documents/plyr/plyr")
load_all("~/documents/plyr/dplyr")
library(data.table)
data("baseball", package = "plyr")
vars <- list(n = quote(length(id)), m = quote(n + 1))
}
# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#: user system elapsed
#: 0.451 0.003 0.453
# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#: user system elapsed
#: 0.029 0.000 0.029
# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#: user system elapsed
#: 0.008 0.000 0.008
It doesn't time code inside a top-level {
block, so you can choose not to time stuff you're not interested in.
I don't think there's anyway to automatically add timing as a top-level effect without somehow modifying the way that you run the code - i.e. using something like time_file
instead of source
.
You might wonder the effect that timing every top-level operation has on the overall speed of your code. Well, that's easy to answer with a microbenchmark ;)
library(microbenchmark)
microbenchmark(
runif(1e4),
system.time(runif(1e4)),
system.time(runif(1e4), gc = FALSE)
)
So timing adds relatively little overhead (20µs on my computer), but the default gc adds about 27 ms per call. So unless you have thousands of top-level calls, you're unlikely to see much impact.
Upvotes: 5