Reputation: 1
I want to make a workflow to convert BCL files from sequencer to expression matrix using cellranger software. I am new to snakemake.
I copy files from storage to local machine, launch in shell mkfastq to generate FASTQ files and store in FASTQ/
.
In order to generate expression matrix from FASTQ files I should pass the whole FASTQ directory to cellranger. After that, cellranger creates sample directories where it stores expression matrices, reports. logs and other files.
My pipeline:
samples = ['201', '202']
fc_name = '230119_FOO'
run = 'storage/vud/230119_FOO'
rule all:
input:
expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name = fc_name)
#Copy from storage to local machine
rule copy:
input:
expand({run}, run=run)
output:
expand("BCL/{fc_name}", fc_name = fc_name)
shell:
"rsync -ah {run} BCL/"
#Make FASTQ files
rule mkfastq:
input:
fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
output:
expand("FASTQ/{fc_name}", fc_name = fc_name),
expand("FASTQ/{fc_name}/outs/input_samplesheet.csv", fc_name = fc_name)
shell:
"cellranger mkfastq --run={input.fastq_run} --id={fc_name} --output-dir=FASTQ/"
# Make matrices
rule mkmat:
input:
expand("FASTQ/{fc_name}", fc_name = fc_name)
output:
expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name=fc_name)
shell:
expand("cellranger count -id=RESULT/{{fc_name}}/{{samples}} --transcriptome=refdata-gex-mm10-2020-A/ --fastqs=FASTQ/230119_FOO --sample={{samples}}", samples = samples, fc_name=fc_name)
I perform dry-run of pipeline and snakemake throws an error:
File "/miniconda3/envs/snakemake/lib/python3.11/site-packages/snakemake/jobs.py", line 521, in shellcmd
self.format_wildcards(self.rule.shellcmd)
File "/miniconda3/envs/snakemake/lib/python3.11/site-packages/snakemake/jobs.py", line 986, in format_wildcards
f"{ex.__class__.__name__}: {ex}, when formatting the following:\n"
TypeError: can only concatenate str (not "list") to str
How to pass a directory "FASTQ/230119_FOO"
to mkmat rule and get this output:
├── RESULT
│ ├── 230119_FOO
│ │ ├── 201
│ │ │ ├── ...
│ │ ├── 202
│ │ │ ├── ...
Upvotes: 0
Views: 145
Reputation: 6600
First of all you have an issue with the shell:
section of the rule mkmat:
. This section shall be a string, while the expand
function returns a list of strings, and that is exactly what the interpreter complains on:
TypeError: can only concatenate str (not "list") to str
Anyway, --dry-run
ignores the contents of the shell:
sections (as long as they provide valid strings), so for the clarity of your question we may just remove them and try --dry-run
again.
samples = ['201', '202']
fc_name = '230119_FOO'
run = 'storage/vud/230119_FOO'
rule all:
input:
expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name = fc_name)
rule copy:
input:
expand({run}, run=run)
output:
expand("BCL/{fc_name}", fc_name = fc_name)
rule mkfastq:
input:
fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
output:
expand("FASTQ/{fc_name}", fc_name = fc_name),
expand("FASTQ/{fc_name}/outs/input_samplesheet.csv", fc_name = fc_name)
rule mkmat:
input:
expand("FASTQ/{fc_name}", fc_name = fc_name)
output:
expand("RESULT/{fc_name}/{sample}", sample=samples, fc_name=fc_name)
Now the problem is in the rule mkfastq:
, as it claims two outputs where one is a child of the other:
ChildIOException:
File/directory is a child to another output:
...\FASTQ\230119_FOO
...\FASTQ\230119_FOO\outs\input_samplesheet.csv
I'm not sure whether that can be considered as a bug in Snakemake (for example you may read this discussion: https://github.com/bioinformatics-centre/BayesTyper/issues/29). Anyway there are workarounds, and the easiest one in your simplified case is to claim as the output only the directory:
rule mkfastq:
input:
fastq_run=expand("BCL/{fc_name}", fc_name = fc_name)
output:
expand("FASTQ/{fc_name}", fc_name = fc_name)
Overall your pipeline is overly simplified, as it doesn't even contain wildcards (so you don't even need any tricks with expand
calls, as your pipeline is fully defined with the global variables). In real pipelines dependencies should be defined not in global variables but by the structure of your filesystem. So in more complex pipelines just removing one of the outputs in rule mkfastq:
may not help, and you would need to redesign the pipeline.
Upvotes: 1