AdamNYC
AdamNYC

Reputation: 20445

Return system.time by default

I have to do extensive data manipulation on a big data set (using data.table, RStudio mostly). I would like to monitor run time for each of my step without explicitly call system.time() on each step.

Is there a package or an easy way to show run time by default on each step?

Thank you.

Upvotes: 5

Views: 279

Answers (2)

Hack-R
Hack-R

Reputation: 23241

I have to give full credit to @jbecker of Freenode's #R IRC channel for this extra answer, but for me the solution is here: http://adv-r.had.co.nz/Profiling.html

Here's just a little taste of it:

"To understand performance, you use a profiler. There are a number of different types of profilers. R uses a fairly simple type called a sampling or statistical profiler. A sampling profiler stops the execution of code every few milliseconds and records which function is currently executing (along with which function called that function, and so on). For example, consider f(), below:"

library(lineprof)
f <- function() {
  pause(0.1)
  g()
  h()
}
g <- function() {
  pause(0.1)
  h()
}
h <- function() {
  pause(0.1)
}

Upvotes: 0

hadley
hadley

Reputation: 103948

It's not exactly what you're asking for, but I've written time_file (https://gist.github.com/4183595) which source()s an R file, and runs the code, then rewrites the file, inserting comments containing how long each top-level statement took to run.

i.e. time_file() turns this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")

into this:

{
  load_all("~/documents/plyr/plyr")
  load_all("~/documents/plyr/dplyr")
  library(data.table)
  data("baseball", package = "plyr")
  vars <- list(n = quote(length(id)), m = quote(n + 1))
}

# Baseline case: use ddply
a <- ddply(baseball, "id", summarise, n = length(id))
#:    user  system elapsed
#:   0.451   0.003   0.453

# New summary method: ~20x faster
b <- summarise_by(baseball, group("id"), vars)
#:    user  system elapsed
#:   0.029   0.000   0.029

# But still not as fast as specialised count, which is basically id + tabulate
# so maybe able to eke out a little more with a C loop ?
count(baseball, "id")
#:    user  system elapsed
#:   0.008   0.000   0.008

It doesn't time code inside a top-level { block, so you can choose not to time stuff you're not interested in.

I don't think there's anyway to automatically add timing as a top-level effect without somehow modifying the way that you run the code - i.e. using something like time_file instead of source.

You might wonder the effect that timing every top-level operation has on the overall speed of your code. Well, that's easy to answer with a microbenchmark ;)

library(microbenchmark)
microbenchmark(
  runif(1e4), 
  system.time(runif(1e4)),
  system.time(runif(1e4), gc = FALSE)
)

So timing adds relatively little overhead (20µs on my computer), but the default gc adds about 27 ms per call. So unless you have thousands of top-level calls, you're unlikely to see much impact.

Upvotes: 5

Related Questions