Benjamin Christoffersen
Benjamin Christoffersen

Reputation: 4841

knitr cache: update if data file changes but not other options (e.g., `fig.xyz`)

Suppose that I use knitr, I have a chunk which takes a while to run, I want this chunk to update if a file changes but not if I e.g., change fig.path. The later suggest that I should change the cache chunk option to 1 but then I cannot use a check sum as suggested here.

Here is an example of a markdown file

---
title: "Example"
author: "Benjamin Christoffersen"
date: "September 2, 2018"
output: html_document
---

```{r setup, include=FALSE}
data_file <- "~/data.RDS"
knitr::opts_chunk$set(echo = TRUE, cache.extra = tools::md5sum(data_file))
```

```{r load_data}
dat <- readRDS(data_file)
```

```{r large_computation, cache = 1}
Sys.sleep(10)
Sys.time() # just to that result do not change
```

```{r make_some_plot}
hist(dat)
```

Running set.seed(1): saveRDS(rnorm(100), "~/data.RDS") and knitting yields

enter image description here

Then running set.seed(2): saveRDS(rnorm(100), "~/data.RDS") and knitting yields

enter image description here

showing that large_computation is not updated as is should not since cache.extra is not in the knitr:::cache1.opts vector. Of course, I can save the md5sum result, check the previous stored file and use cache.rebuild or do something similar in the large_computation chunk but it would be nice with a knitr solution. I often find that I change some chunk options (e.g., dpi, fig.width, and fig.height) so using cache = TRUE will not work. I guess one could modify the package to be able to add options to knitr:::cache1.opts.

Upvotes: 5

Views: 1387

Answers (2)

fenrir
fenrir

Reputation: 1

I found another solution to the ignorance of cache.extra when cache=1 or 2. Please insert the following hook code to the setup section, which inserts extra comment to a code section to invalidate a cache when the cache.extra is changed.

knitr::opts_hooks$set(cache.extra = function(options){
  # invalidate cache 
  options$code <- c(sprintf("# cache.extra: %s", options$cache.extra), options$code)
  options
})

Upvotes: 0

CL.
CL.

Reputation: 14957

If I understand the question correctly, the problem is that cache.extra is not taken into account if cache is set to 1. In fact, this is by design.

The desired behavior is to invalidate the cache of all chunks (including chunks with cache = 1) if an external file (or more general: some value provided to cache.extra) changes.

As mentioned in the question, one way to achieve this is using the chunk option cache.rebuild but instead of manually keeping track of changes in the external file, I'd take advantage if knitr's built-in caching capabilies:

```{r cachecontrol, cache = TRUE, cache.extra = tools::md5sum(data_file)}
knitr::opts_chunk$set(cache.rebuild = TRUE)
```

Adding this as an early chunk, the cache of all subsequent chunks is invalidated if data_file changes. The idea is to cache the chunk that controls caching of subsequent chunks – but only if the external file is unchanged.

Of course, this only works if no global chunk options are changed before the cachecontrol chunk is evaluated.


Full example from the question:

Run set.seed(1); saveRDS(rnorm(100), "data.RDS") with different seeds to generate different external files, then knit:

---
title: "Invalidate all chunks condidional on external file (even if cache=1)"
output: html_document
---

```{r}
data_file <- "data.RDS"
```

```{r cachecontrol, include = FALSE, cache = TRUE, cache.extra = tools::md5sum(data_file)}
# do NOT change global chunk options before this chunk
knitr::opts_chunk$set(cache.rebuild = TRUE)
```

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 8)
```


```{r load_data}
dat <- readRDS(data_file)
```

```{r large_computation, cache = 1}
Sys.sleep(10)
Sys.time() # just to show that result do not change unless external file changes
```

```{r make_some_plot}
hist(dat)
```

Upvotes: 2

Related Questions