maxheld
maxheld

Reputation: 4273

How to elegantly + robustly cache external script in knitr rmd document?

Say, I have an external R script external.R:

df.rand <- data.frame(rnorm(n = 100), rnorm(n = 100))

Then there's a main.Rmd:

\documentclass{article}

\begin{document}

<<setup, include = FALSE>>=
library(knitr)
library(ggplot2)
# global chunk options
opts_chunk$set(cache=TRUE, autodep=TRUE, concordance=TRUE, progress=TRUE, cache.extra = tools::md5sum("external.r"))
@

<<source, include=FALSE>>=
source("external.R")
@


<<plot>>=
ggplot(data = df.rand, mapping = aes(x = x, y = y)) + geom_point()
@

\end{document}

It's helpful to have this in an external script, because in reality, it's a bunch of import, data cleaning and simulation tasks that would pollute the main.Rmd.

Any chunks in main.Rmd depend on changes in the external script. To account for this dependency I added the above cache.extra = tools::md5sum("external.r").

That seems to work ok.

I'm looking for best practices.

There are no side effects (except for some library()calls, but I can move them to main.Rmd).

I'm always worried that I'm somehow doing it wrong.

Upvotes: 4

Views: 281

Answers (1)

CL.
CL.

Reputation: 14957

There should be better approaches than the do-it-yourself caching you currently use. To start with, you could split external.R into chunks:

# ---- CreateRandomDFs----
df.rand1 <- data.frame(rnorm(n = 100), rnorm(n = 100))
df.rand2 <- data.frame(rnorm(n = 100), rnorm(n = 100))

# ---- CreateOtherObjects----

# stuff

In main.Rmd, add (in a uncached chunk!) read_chunk(path = 'external.R'). Then execute the chunks:

<<CreateRandomDFs>>=
@
<<CreateOtherObjects>>=
@

If autodep doesn't work, add dependson to your chunks. A chunk that only uses df.rand1 and df.rand2 gets dependson = "CreateRandomDFs"; when other objects are also used, set dependson = c("CreateRandomDFs", "CreateOtherObjects").

You may also invalidate a chunk's cache when a certain object changes: cache.whatever = quote(df.rand1).

This way, you avoid invalidating the whole cache with any change in external.R. It is crucial how you split the code in that file into chunks: If you use too many chunks, you will have to list many dependencies; if you use too few chunks, cache gets invalidated more/too often.

Upvotes: 3

Related Questions