dariober
dariober

Reputation: 9062

How to execute R inside snakemake

I find the snakemake documentation quite terse. Since I've been asked this a few times I thought posting the question and my own answer.

How do you execute or integrate this R script in a snakemake rule?

my-script.R

CUTOFF <- 25

dat <- read.table('input-table.txt')

out <- dat[dat$V1 > CUTOFF, ]

write.table(out, 'output-table.txt')

Snakefile:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    # now what?        

Upvotes: 2

Views: 5765

Answers (3)

Fritz
Fritz

Reputation: 31

I've created a simple template for running snakemake with R - here's the Github link:

https://github.com/fritzbayer/snakemake-with-R

It shows two simple options for passing variables between snakemake and R.

Upvotes: 3

merv
merv

Reputation: 76950

@dariober has a thorough answer. I wanted to share the variation on Option 1 that I use to make interactive development/debugging of R scripts a bit easier.

Mock snakemake Object

One can include a preamble to create a mock snakemake object conditioned on whether the script is run interactively or not. This mimics the actual object that gets instantiated, enabling one to step through the code with a realistic input, but gets skipped over when Snakemake is executing the script.

scripts/my-script.R

#!/usr/bin/env Rscript

################################################################################
## Mock `snakemake` preamble
################################################################################

if (interactive()) {
  library(methods)
  Snakemake <- setClass(
    "Snakemake", 
    slots=c(
      input='list', 
      output='list', 
      params='list',
      threads='numeric'
    )
  )
  snakemake <- Snakemake(
    input=list(txt="input-table.txt"),
    output=list(out="output-table.txt"),
    params=list(cutoff=25),
    threads=1
  )
}

################################################################################

CUTOFF <- snakemake@params$cutoff

dat <- read.table(snakemake@input$txt)

out <- dat[dat$V1 > CUTOFF, ]

write.table(out, snakemake@output$out)

Object Details

The full set of slots on the Snakemake class can be seen in the generate_preamble code from the snakemake.script.Rscript class. Since this is subject to change, one can inspect the code in the installed version using (from Python where Snakemake is installed):

from inspect import getsource
from snakemake.script import RScript

print(getsource(RScript.generate_preamble))

Upvotes: 1

dariober
dariober

Reputation: 9062

I'm going to propose three solutions, roughly in order from most canonical.

OPTION 1: Use the script directory in combination with the snakemake R object

Replace the actual filenames inside the R script with the S4 object snakemake which holds input, output, params etc:

my-script.R

CUTOFF <- snakemake@params[['cutoff']]

dat <- read.table(snakemake@input[['txt']])

out <- dat[dat$V1 > CUTOFF, ]

write.table(out, snakemake@output[['out']])

Similarly, wildcards values would be accessible via snakemake@wildcards[['some-wildcard']]

The rule would look like:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    script:
        '/path/to/my-script.R'

Note that the script filename must end in .R or .r to instruct snakemake that this is an R script.

Pros

  • This is the simplest option, just replace the actual filenames and parameters with snakemake@... in the R script

Cons

  • You may have many short scripts that make the pipeline difficult to follow. What does my-script.R actually do?

  • --printshellcmds option is not helpful here

  • Debugging the R script may be cluncky because of the embedded snakemake object


OPTION 2: Write a standalone R script executable via shell directive

For example using the argparse library for R:

my-script.R

#!/usr/bin/env Rscript

library(argparse)

parser <- ArgumentParser(description= 'This progrom does stuff')

parser$add_argument('--input', '-i', help= 'I am the input file')
parser$add_argument('--output', '-o', help= 'I am the output file')
parser$add_argument('--cutoff', '-c', help= 'Some filtering cutoff', type= 'double')

xargs<- parser$parse_args()

CUTOFF <- xargs$cutoff
dat <- read.table(xargs$input)
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, xargs$output)

The snakemake rule is like any other one executing shell commands:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    shell:
        r"""
        /path/to/my-script.R --input {input.txt} --output {output.out} --cutoff {params.cutoff}
        """

Pros

  • You get a standalone, nicely documented script that you can use elsewhere

  • --printshellcmds tells you what is being executed and you can re-run it outside snakemake

Cons

  • Some setting up to do via argparse

  • Not so easy to debug by opening the R interpreter and running the individual R commands


OPTION 3 Create a temporary R script as an heredoc that you run via Rscript:

This is all self-contained in the rule:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    params:
        cutoff= 25,
    shell:
        r"""
cat <<'EOF' > {rule}.$$.tmp.R

CUTOFF <- {params.cutoff}
dat <- read.table('{input.txt}')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, '{output.out}')

EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
        """

Explanantion: The syntax cat <<'EOF' tells Bash that everything until the EOF marker is a string to be written to file {rule}.$$.tmp.R. Before running this shell script, snakemake will replace the placeholders {...} as usual. So file {rule}.$$.tmp.R will be one.$$.tmp.R where bash will replace $$ with the current process ID. The combination of {rule} and $$ reasonably ensures that each temporary R script has a distinct filename. NB: EOF is not a special keyword, you can use any marker string to delimit the heredoc.

The content of {rule}.$$.tmp.R will be:

CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')

after execution via Rscript, {rule}.$$.tmp.R will be deleted by rm provided that Rscripts exited clean.

Pros

  • Especially for short scripts, you see what the rule actually does. No need to look into other script files.

  • --printshellcmds option shows the exact R code. You can copy & paste it to the R interpreter as it is which is very useful for debugging and developing

Cons

  • Clunky, quite a bit of boiler plate code

  • If input/output/params are lists, you need to split them with e.g. strsplit('{params}', sep= ' ') inside the R script. This is not great especially if you have space inside the list items.

  • If Rscript fails the temp R script is not deleted and it litters your working dir.

Upvotes: 10

Related Questions