Losashik
Losashik

Reputation: 1

How to make rule with directory as input and multiple directories/files as output?

I want to make a workflow to convert BCL files from sequencer to expression matrix using cellranger software. I am new to snakemake.

I copy files from storage to local machine, launch in shell mkfastq to generate FASTQ files and store in FASTQ/.

In order to generate expression matrix from FASTQ files I should pass the whole FASTQ directory to cellranger. After that, cellranger creates sample directories where it stores expression matrices, reports. logs and other files.

My pipeline:

samples = ['201', '202']
fc_name = '230119_FOO'
run = 'storage/vud/230119_FOO'

rule all:
        input:
                expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name = fc_name)
                
#Copy from storage to local machine
rule copy:
        input:
                expand({run}, run=run)
        output:
                expand("BCL/{fc_name}", fc_name = fc_name)
        shell:
                "rsync -ah {run} BCL/"

#Make FASTQ files
rule mkfastq:
        input:
                fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
        output:
                expand("FASTQ/{fc_name}", fc_name = fc_name),
                expand("FASTQ/{fc_name}/outs/input_samplesheet.csv", fc_name = fc_name)
        shell:
                "cellranger mkfastq --run={input.fastq_run} --id={fc_name} --output-dir=FASTQ/"

# Make matrices
rule mkmat:
        input:
                expand("FASTQ/{fc_name}", fc_name = fc_name)
        output:
                expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name=fc_name)
        shell:
                expand("cellranger count -id=RESULT/{{fc_name}}/{{samples}} --transcriptome=refdata-gex-mm10-2020-A/ --fastqs=FASTQ/230119_FOO --sample={{samples}}", samples = samples, fc_name=fc_name)

I perform dry-run of pipeline and snakemake throws an error:

 File "/miniconda3/envs/snakemake/lib/python3.11/site-packages/snakemake/jobs.py", line 521, in shellcmd
    self.format_wildcards(self.rule.shellcmd)
  File "/miniconda3/envs/snakemake/lib/python3.11/site-packages/snakemake/jobs.py", line 986, in format_wildcards
    f"{ex.__class__.__name__}: {ex}, when formatting the following:\n"
TypeError: can only concatenate str (not "list") to str

How to pass a directory "FASTQ/230119_FOO" to mkmat rule and get this output:

├── RESULT
│   ├── 230119_FOO
│   │   ├── 201
│   │   │   ├── ...
│   │   ├── 202
│   │   │   ├── ...

Upvotes: 0

Views: 145

Answers (1)

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6600

First of all you have an issue with the shell: section of the rule mkmat:. This section shall be a string, while the expand function returns a list of strings, and that is exactly what the interpreter complains on:

TypeError: can only concatenate str (not "list") to str

Anyway, --dry-run ignores the contents of the shell: sections (as long as they provide valid strings), so for the clarity of your question we may just remove them and try --dry-run again.

samples = ['201', '202']
fc_name = '230119_FOO'
run = 'storage/vud/230119_FOO'

rule all:
        input:
                expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name = fc_name)
                
rule copy:
        input:
                expand({run}, run=run)
        output:
                expand("BCL/{fc_name}", fc_name = fc_name)

rule mkfastq:
        input:
                fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
        output:
                expand("FASTQ/{fc_name}", fc_name = fc_name),
                expand("FASTQ/{fc_name}/outs/input_samplesheet.csv", fc_name = fc_name)

rule mkmat:
        input:
                expand("FASTQ/{fc_name}", fc_name = fc_name)
        output:
                expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name=fc_name)

Now the problem is in the rule mkfastq:, as it claims two outputs where one is a child of the other:

ChildIOException:
File/directory is a child to another output:
...\FASTQ\230119_FOO
...\FASTQ\230119_FOO\outs\input_samplesheet.csv

I'm not sure whether that can be considered as a bug in Snakemake (for example you may read this discussion: https://github.com/bioinformatics-centre/BayesTyper/issues/29). Anyway there are workarounds, and the easiest one in your simplified case is to claim as the output only the directory:

rule mkfastq:
        input:
                fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
        output:
                expand("FASTQ/{fc_name}", fc_name = fc_name)

Overall your pipeline is overly simplified, as it doesn't even contain wildcards (so you don't even need any tricks with expand calls, as your pipeline is fully defined with the global variables). In real pipelines dependencies should be defined not in global variables but by the structure of your filesystem. So in more complex pipelines just removing one of the outputs in rule mkfastq: may not help, and you would need to redesign the pipeline.

Upvotes: 1

Related Questions