Reputation: 11
My colleague and I have put together a read mapping and variant calling pipeline using Snakemake. We’re using two different sets of short-read whole genome samples. One set of samples consists of newly generated WGS data and the other consists of samples from the NCBI SRA repository. All of them are being mapped to the same reference genome. For clarity, the sample names are as follows:
sra_ids = [
"SRR1657028", "SRR1657029", "SRR1575526", "SRR1575545", "SRR1575527",
"SRR1575528", "SRR1575543", "SRR1575544", "SRR1575541", "SRR1575542", "SRR1575539",
"SRR1575540", "SRR1575532", "SRR1575531", "SRR1575534", "SRR1575533", "SRR1575538",
"SRR1575537", "SRR1575536", "SRR1575535", "SRR1575530", "SRR1575529"]
new_samples = [
"RANO330-OMH_Pedw1", "RANO332-OMH_Pedw2", "RANO54-OMH_Pedw3", "TSINJ32-OMH_Pdia1",
"TSINJ38-OMH_Pdia2", "TSINJ47-OMH_Pdia3", "JEJ01-OMH_Pcan1", "JEJ3-11-OMH_Pcan2",
"MERY3-OMH_Pcan3", "ANAL10-OMH_Pper1", "TOBI5-1-OMH_Pper2", "TOBI5-3-OMH_Pper3",
"DAR4-11-OMH_Ptat1", "DAR4-39-OMH_Ptat2", "DAR4-5-OMH_Ptat3", "JAM4-16-OMH_Pcor1",
"JAM4-20-OMH_Pcor2", "JAM4-7-OMH_Pcor3", "KIBO15-OMH_Pdec1", "KIBO36-OMH_Pdec2",
"KIBO44-OMH_Pdec3", "KMTEA7-10-OMH_Pver1", "KMTEA7-2-OMH_Pver2", "KMTEA7-4-OMH_Pver3",
"DASI5-08-OMH_Alang1", "DASI5-16-OMH_Alang2", "DASI5-21-OMH_Alang3"]
In our initial rules, we download and process SRA data. We now want to, via symlinks, pull all of the fastqs (from both sets: sra_ids and new_samples) into a single directory and use the same file name formatting to allow all samples to be processed the same way in all downstream rules. The following rule is meant to create said symlinks (note that the sra1 and sra2 input functions point to relative paths, while the new1 and new2 point to absolute paths):
rule consolidate_fastqs:
input:
sra1 = expand(
"renamed_fastqs/{sample}_fixed_1.fastq.gz",
sample=sra_ids),
sra2 = expand(
"renamed_fastqs/{sample}_fixed_2.fastq.gz",
sample=sra_ids),
new1 = expand(
os.path.join(fastq_directory, "{sample}_read1.fastq.gz"), sample=new_samples),
new2 = expand(
os.path.join(fastq_directory, "{sample}_read2.fastq.gz"), sample=new_samples)
output:
expand(
"fastqs_consolidated/{sample}_{read}.fastq.gz",
sample=initial_sample_list,
read=["read1", "read2"])
params:
threads = 1,
mem = 4,
t = very_short
run:
for i in input.sra1:
original = i
basename = i.split("/")[-1].split("_")[0]
new_name = "fastqs_consolidated/{}_read1.fastq.gz".format(basename)
shell(
"ln -srf {original} {new_name} && touch -h {new_name}")
for i in input.sra2:
original = i
basename = i.split("/")[-1].split("_")[0]
new_name = "fastqs_consolidated/{}_read2.fastq.gz".format(basename)
shell(
"ln -srf {original} {new_name} && touch -h {new_name}")
for i in input.new1:
original = i
basename = i.split("/")[-1].split("_")[0]
new_name = "fastqs_consolidated/{}_read1.fastq.gz".format(basename)
shell(
"ln -sf {original} {new_name} && touch -h {new_name}")
for i in input.new2:
original = i
basename = i.split("/")[-1].split("_")[0]
new_name = "fastqs_consolidated/{}_read2.fastq.gz".format(basename)
shell(
"ln -sf {original} {new_name} && touch -h {new_name}")
We’re running this pipeline on a cluster using CentOS 3.9 and Slurm for job management. The pipeline fails at the above rule however, with the following error:
MissingOutputException in line 241 of /scratch/general/lustre/u6035429/DissAssembly/Snakefile:
Job Missing files after 5 seconds:
fastqs_consolidated/RANO330-OMH_Pedw1_read1.fastq.gz
fastqs_consolidated/RANO330-OMH_Pedw1_read2.fastq.gz
fastqs_consolidated/RANO332-OMH_Pedw2_read1.fastq.gz
fastqs_consolidated/RANO332-OMH_Pedw2_read2.fastq.gz
fastqs_consolidated/RANO54-OMH_Pedw3_read1.fastq.gz
fastqs_consolidated/RANO54-OMH_Pedw3_read2.fastq.gz
fastqs_consolidated/TSINJ32-OMH_Pdia1_read1.fastq.gz
fastqs_consolidated/TSINJ32-OMH_Pdia1_read2.fastq.gz
fastqs_consolidated/TSINJ38-OMH_Pdia2_read1.fastq.gz
fastqs_consolidated/TSINJ38-OMH_Pdia2_read2.fastq.gz
fastqs_consolidated/TSINJ47-OMH_Pdia3_read1.fastq.gz
fastqs_consolidated/TSINJ47-OMH_Pdia3_read2.fastq.gz
fastqs_consolidated/JEJ01-OMH_Pcan1_read1.fastq.gz
fastqs_consolidated/JEJ01-OMH_Pcan1_read2.fastq.gz
fastqs_consolidated/JEJ3-11-OMH_Pcan2_read1.fastq.gz
fastqs_consolidated/JEJ3-11-OMH_Pcan2_read2.fastq.gz
fastqs_consolidated/MERY3-OMH_Pcan3_read1.fastq.gz
fastqs_consolidated/MERY3-OMH_Pcan3_read2.fastq.gz
fastqs_consolidated/ANAL10-OMH_Pper1_read1.fastq.gz
fastqs_consolidated/ANAL10-OMH_Pper1_read2.fastq.gz
fastqs_consolidated/TOBI5-1-OMH_Pper2_read1.fastq.gz
fastqs_consolidated/TOBI5-1-OMH_Pper2_read2.fastq.gz
fastqs_consolidated/TOBI5-3-OMH_Pper3_read1.fastq.gz
fastqs_consolidated/TOBI5-3-OMH_Pper3_read2.fastq.gz
fastqs_consolidated/DAR4-11-OMH_Ptat1_read1.fastq.gz
fastqs_consolidated/DAR4-11-OMH_Ptat1_read2.fastq.gz
fastqs_consolidated/DAR4-39-OMH_Ptat2_read1.fastq.gz
fastqs_consolidated/DAR4-39-OMH_Ptat2_read2.fastq.gz
fastqs_consolidated/DAR4-5-OMH_Ptat3_read1.fastq.gz
fastqs_consolidated/DAR4-5-OMH_Ptat3_read2.fastq.gz
fastqs_consolidated/JAM4-16-OMH_Pcor1_read1.fastq.gz
fastqs_consolidated/JAM4-16-OMH_Pcor1_read2.fastq.gz
fastqs_consolidated/JAM4-20-OMH_Pcor2_read1.fastq.gz
fastqs_consolidated/JAM4-20-OMH_Pcor2_read2.fastq.gz
fastqs_consolidated/JAM4-7-OMH_Pcor3_read1.fastq.gz
fastqs_consolidated/JAM4-7-OMH_Pcor3_read2.fastq.gz
fastqs_consolidated/KIBO15-OMH_Pdec1_read1.fastq.gz
fastqs_consolidated/KIBO15-OMH_Pdec1_read2.fastq.gz
fastqs_consolidated/KIBO36-OMH_Pdec2_read1.fastq.gz
fastqs_consolidated/KIBO36-OMH_Pdec2_read2.fastq.gz
fastqs_consolidated/KIBO44-OMH_Pdec3_read1.fastq.gz
fastqs_consolidated/KIBO44-OMH_Pdec3_read2.fastq.gz
fastqs_consolidated/KMTEA7-10-OMH_Pver1_read1.fastq.gz
fastqs_consolidated/KMTEA7-10-OMH_Pver1_read2.fastq.gz
fastqs_consolidated/KMTEA7-2-OMH_Pver2_read1.fastq.gz
fastqs_consolidated/KMTEA7-2-OMH_Pver2_read2.fastq.gz
fastqs_consolidated/KMTEA7-4-OMH_Pver3_read1.fastq.gz
fastqs_consolidated/KMTEA7-4-OMH_Pver3_read2.fastq.gz
fastqs_consolidated/DASI5-08-OMH_Alang1_read1.fastq.gz
fastqs_consolidated/DASI5-08-OMH_Alang1_read2.fastq.gz
fastqs_consolidated/DASI5-16-OMH_Alang2_read1.fastq.gz
fastqs_consolidated/DASI5-16-OMH_Alang2_read2.fastq.gz
fastqs_consolidated/DASI5-21-OMH_Alang3_read1.fastq.gz
fastqs_consolidated/DASI5-21-OMH_Alang3_read2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job consolidate_fastqs since they might be corrupted:
fastqs_consolidated/SRR1657028_read1.fastq.gz, fastqs_consolidated/SRR1657028_read2.fastq.gz, fastqs_consolidated/SRR1657029_read1.fastq.gz, fastqs_consolidated/SRR1657029_read2.fastq.gz, fastqs_consolidated/SRR1575526_read1.fastq.gz,
And so on for the other SRA samples.
We’ve played with increasing the latency time to no avail. Moreover, we’ve monitored the fastqs_consolidated directory in real time and the rule is producing functioning symlinks for all the fastq files. What’s even stranger is that for whatever reason it is not recognizing the output for the new samples but only deleting the symlinks for the SRA samples. It does this even if we manually create symlinks. Despite the supposed missing output consisting of the new samples, functioning symlinks to those fastq files remain in the output directory (fastqs_consolidated).
We’ve tried various iterations of symlink commands, including running with and without the “-r” flag and with and without relative paths from the symlink to the original file (i.e., both ../{original} and {original}).
Do you have any suggestions for what might be going wrong?
Upvotes: 1
Views: 245
Reputation: 9062
In my opinion, relying on naming convention is unreliable and difficult to maintain. This is what I would do...
Prepare a sample sheet (a table) that links sample IDs to fastq files. In this table you can also add some information about samples that you may need for the processing. Read this file with pandas
(better) or using your own csv/tsv parser. The information in the sample sheet than will instruct snakemake on how to proceed. Note the sample names may be unrelated to fastq file names and local fastq files can be in any directory. I think this is a clean and flexible setup. For example, this is sample_sheet.tsv
:
sample_id sra_id fastq_r1 fastq_r2
sample1 SRR1657028 sra/SRR1657028.fastq.R1.gz sra/SRR1657028.fastq.R2.gz
sample2 NA fastq_dir/RANO330-OMH_Pedw1.R1.fastq.gz fastq_dir/RANO330-OMH_Pedw1.R2.fastq.gz
The snakefile:
import pandas
ss = pandas.read_csv('sample_sheet.tsv', sep= '\t')
rule all:
input:
expand('vcf/{sample_id}.vcf', sample_id= ss.sample_id),
rule sra_download:
output:
'sra/{sra_id}.fastq.R1.gz',
'sra/{sra_id}.fastq.R2.gz',
shell:
r"""
fastq-dump --sra-id {wildcards.sra_id} ...
# or better, query ENA archive
"""
rule align:
input:
# Use the sample sheet to link sample_id to fastq files
fq1= lambda wc: ss[ss.sample_id == wc.sample_id].fastq_r1.iloc[0],
fq2= lambda wc: ss[ss.sample_id == wc.sample_id].fastq_r2.iloc[0],
ref= 'genome.fa'
output:
bam= 'bam/{sample_id}.bam',
shell:
r"""
bwa mem or whatever {input.ref} {input.fq1} {input.fq2} > {output.bam}
"""
rule call_variants:
input:
bam= 'bam/{sample_id}.bam',
output:
vcf= 'vcf/{sample_id}.vcf',
shell:
r"""
call variants {input.bam} > {output.vcf}
"""
There is no need to rename files and rely on any naming scheme. If fastq files exist on the file system, snakemake will procede to the alignment. Otherwise, it will use rule sra_download
to get the files. I haven't tested this and there may be errors but hopefully you get the idea.
Upvotes: 0