Reputation: 9062
I find the snakemake documentation quite terse. Since I've been asked this a few times I thought posting the question and my own answer.
How do you execute or integrate this R script in a snakemake rule?
my-script.R
CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')
Snakefile:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
# now what?
Upvotes: 2
Views: 5765
Reputation: 31
I've created a simple template for running snakemake with R - here's the Github link:
https://github.com/fritzbayer/snakemake-with-R
It shows two simple options for passing variables between snakemake and R.
Upvotes: 3
Reputation: 76950
@dariober has a thorough answer. I wanted to share the variation on Option 1 that I use to make interactive development/debugging of R scripts a bit easier.
snakemake
ObjectOne can include a preamble to create a mock snakemake
object conditioned on whether the script is run interactively or not. This mimics the actual object that gets instantiated, enabling one to step through the code with a realistic input, but gets skipped over when Snakemake is executing the script.
scripts/my-script.R
#!/usr/bin/env Rscript
################################################################################
## Mock `snakemake` preamble
################################################################################
if (interactive()) {
library(methods)
Snakemake <- setClass(
"Snakemake",
slots=c(
input='list',
output='list',
params='list',
threads='numeric'
)
)
snakemake <- Snakemake(
input=list(txt="input-table.txt"),
output=list(out="output-table.txt"),
params=list(cutoff=25),
threads=1
)
}
################################################################################
CUTOFF <- snakemake@params$cutoff
dat <- read.table(snakemake@input$txt)
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, snakemake@output$out)
The full set of slots on the Snakemake
class can be seen in the generate_preamble
code from the snakemake.script.Rscript
class. Since this is subject to change, one can inspect the code in the installed version using (from Python where Snakemake is installed):
from inspect import getsource
from snakemake.script import RScript
print(getsource(RScript.generate_preamble))
Upvotes: 1
Reputation: 9062
I'm going to propose three solutions, roughly in order from most canonical.
OPTION 1: Use the script
directory in combination with the snakemake
R object
Replace the actual filenames inside the R script with the S4 object snakemake
which holds input, output, params etc:
my-script.R
CUTOFF <- snakemake@params[['cutoff']]
dat <- read.table(snakemake@input[['txt']])
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, snakemake@output[['out']])
Similarly, wildcards values would be accessible via snakemake@wildcards[['some-wildcard']]
The rule would look like:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
script:
'/path/to/my-script.R'
Note that the script filename must end in .R
or .r
to
instruct snakemake that this is an R script.
Pros
snakemake@...
in the R scriptCons
You may have many short scripts that make the pipeline difficult to follow.
What does my-script.R
actually do?
--printshellcmds
option is not helpful here
Debugging the R script may be cluncky because of the embedded snakemake
object
OPTION 2: Write a standalone R script executable via shell
directive
For example using the argparse library for R:
my-script.R
#!/usr/bin/env Rscript
library(argparse)
parser <- ArgumentParser(description= 'This progrom does stuff')
parser$add_argument('--input', '-i', help= 'I am the input file')
parser$add_argument('--output', '-o', help= 'I am the output file')
parser$add_argument('--cutoff', '-c', help= 'Some filtering cutoff', type= 'double')
xargs<- parser$parse_args()
CUTOFF <- xargs$cutoff
dat <- read.table(xargs$input)
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, xargs$output)
The snakemake rule is like any other one executing shell commands:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
shell:
r"""
/path/to/my-script.R --input {input.txt} --output {output.out} --cutoff {params.cutoff}
"""
Pros
You get a standalone, nicely documented script that you can use elsewhere
--printshellcmds
tells you what is being executed and you can re-run it outside snakemake
Cons
Some setting up to do via argparse
Not so easy to debug by opening the R interpreter and running the individual R commands
OPTION 3 Create a temporary R script as an heredoc that you run via Rscript:
This is all self-contained in the rule:
rule one:
input:
txt= 'input-table.txt',
output:
out= 'output-table.txt',
params:
cutoff= 25,
shell:
r"""
cat <<'EOF' > {rule}.$$.tmp.R
CUTOFF <- {params.cutoff}
dat <- read.table('{input.txt}')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, '{output.out}')
EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
"""
Explanantion: The syntax cat <<'EOF'
tells Bash that everything until the EOF
marker is a string to be written to file {rule}.$$.tmp.R
. Before running this
shell script, snakemake will replace the placeholders {...}
as usual. So file
{rule}.$$.tmp.R
will be one.$$.tmp.R
where bash will replace $$
with the
current process ID. The combination of {rule}
and $$
reasonably ensures
that each temporary R script has a distinct filename. NB: EOF
is not a special
keyword, you can use any marker string to delimit the heredoc.
The content of {rule}.$$.tmp.R
will be:
CUTOFF <- 25
dat <- read.table('input-table.txt')
out <- dat[dat$V1 > CUTOFF, ]
write.table(out, 'output-table.txt')
after execution via Rscript, {rule}.$$.tmp.R
will be deleted by rm
provided
that Rscripts exited clean.
Pros
Especially for short scripts, you see what the rule actually does. No need to look into other script files.
--printshellcmds
option shows the exact R code. You can copy & paste it
to the R interpreter as it is which is very useful for debugging and developing
Cons
Clunky, quite a bit of boiler plate code
If input/output/params are lists, you need to split them with e.g.
strsplit('{params}', sep= ' ')
inside the R script. This is not great
especially if you have space inside the list items.
If Rscript fails the temp R script is not deleted and it litters your working dir.
Upvotes: 10