Snakemake: how to use one integer from list each call as input to script?

Question

I'm trying to practice writing workflows in snakemake.

The contents of my Snakefile:

configfile: "config.yaml"

rule get_col:
  input:
   expand("data/{file}.csv",file=config["datname"])
  output:
   expand("output/{file}_col{param}.csv",file=config["datname"],param=config["cols"])
  params:
   col=config["cols"]
  script:
   "scripts/getCols.R"

The contents of config.yaml:

cols:
  [2,4]
datname:
  "GSE3790_expression_data"

My R script:

getCols=function(input,output,col) {
  dat=read.csv(input)
  dat=dat[,col]
  write.csv(dat,output,row.names=F)
}

getCols(snakemake@input[[1]],snakemake@output[[1]],snakemake@params[['col']])

It seems like both columns are being called at once. What I'm trying to accomplish is one column being called from the list per output file.

Since the second output never gets a chance to be created (both columns are used to create first output), snakemake throws an error:

Waiting at most 5 seconds for missing files.
MissingOutputException in line 3 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Job completed successfully, but some output files are missing.

On a slightly unrelated note, I thought I could write the input as: '"data/{file}.csv"' But that returns:

WildcardError in line 4 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Wildcards in input files cannot be determined from output files:
'file'

Any help would be much appreciated!

jafors · Accepted Answer

Looks like you want to run your Rscript twice per file, once for every value of col. In this case, the rule needs to be called twice as well. The use of expand is also a bit too much here, in my opinion. expand fills your wildcards with all possible values and returns a list of the resulting files. So the output for this rule would be all possible combinations between files and cols, which the simple script can not create in one run. This is also the reason why file can not be inferred from the output - it gets expanded there.

Instead, try writing your rule easier for just one file and column and expand on the resulting output, in a rule which needs this output as an input. If you generated the final output of your workflow, put it as input in a rule all to tell the workflow what the ultimate goal is.

rule all:
  input:
    expand("output/{file}_col{param}.csv",
    file=config["datname"], param=config["cols"])

rule get_col:
  input:
    "data/{file}.csv"
  output:
    "output/{file}_col{param}.csv"
  params:
    col=lambda wc: wc.param
  script:
    "scripts/getCols.R"

Snakemake will infer from rule all (or any other rule to further use the output) what needs to be done and will call the rule get_col accordingly.

Snakemake: how to use one integer from list each call as input to script?

Answers (1)

Related Questions