Snakemake not recognizing functioning symlinks as output and deleting some, but not all, of the output

Question

My colleague and I have put together a read mapping and variant calling pipeline using Snakemake. We’re using two different sets of short-read whole genome samples. One set of samples consists of newly generated WGS data and the other consists of samples from the NCBI SRA repository. All of them are being mapped to the same reference genome. For clarity, the sample names are as follows:

sra_ids = [
    "SRR1657028", "SRR1657029", "SRR1575526", "SRR1575545", "SRR1575527",
    "SRR1575528", "SRR1575543", "SRR1575544", "SRR1575541", "SRR1575542", "SRR1575539",
    "SRR1575540", "SRR1575532", "SRR1575531", "SRR1575534", "SRR1575533", "SRR1575538",
    "SRR1575537", "SRR1575536", "SRR1575535", "SRR1575530", "SRR1575529"]

    new_samples = [
    "RANO330-OMH_Pedw1", "RANO332-OMH_Pedw2", "RANO54-OMH_Pedw3", "TSINJ32-OMH_Pdia1",
    "TSINJ38-OMH_Pdia2", "TSINJ47-OMH_Pdia3", "JEJ01-OMH_Pcan1", "JEJ3-11-OMH_Pcan2",
    "MERY3-OMH_Pcan3", "ANAL10-OMH_Pper1", "TOBI5-1-OMH_Pper2", "TOBI5-3-OMH_Pper3",
    "DAR4-11-OMH_Ptat1", "DAR4-39-OMH_Ptat2", "DAR4-5-OMH_Ptat3", "JAM4-16-OMH_Pcor1",
    "JAM4-20-OMH_Pcor2", "JAM4-7-OMH_Pcor3", "KIBO15-OMH_Pdec1", "KIBO36-OMH_Pdec2",
    "KIBO44-OMH_Pdec3", "KMTEA7-10-OMH_Pver1", "KMTEA7-2-OMH_Pver2", "KMTEA7-4-OMH_Pver3",
    "DASI5-08-OMH_Alang1", "DASI5-16-OMH_Alang2", "DASI5-21-OMH_Alang3"]

In our initial rules, we download and process SRA data. We now want to, via symlinks, pull all of the fastqs (from both sets: sra_ids and new_samples) into a single directory and use the same file name formatting to allow all samples to be processed the same way in all downstream rules. The following rule is meant to create said symlinks (note that the sra1 and sra2 input functions point to relative paths, while the new1 and new2 point to absolute paths):

rule consolidate_fastqs:
    input:
        sra1 = expand(
            "renamed_fastqs/{sample}_fixed_1.fastq.gz",
            sample=sra_ids),
        sra2 = expand(
            "renamed_fastqs/{sample}_fixed_2.fastq.gz",
            sample=sra_ids),
        new1 = expand(
            os.path.join(fastq_directory, "{sample}_read1.fastq.gz"), sample=new_samples),
        new2 = expand(
            os.path.join(fastq_directory, "{sample}_read2.fastq.gz"), sample=new_samples)
    output:
        expand(
            "fastqs_consolidated/{sample}_{read}.fastq.gz",
            sample=initial_sample_list,
            read=["read1", "read2"])
    params:
        threads = 1,
        mem = 4,
        t = very_short
    run:
        for i in input.sra1:
            original = i
            basename = i.split("/")[-1].split("_")[0]
            new_name = "fastqs_consolidated/{}_read1.fastq.gz".format(basename)
            shell(
                "ln -srf {original} {new_name} && touch -h {new_name}")
        for i in input.sra2:
            original = i
            basename = i.split("/")[-1].split("_")[0]
            new_name = "fastqs_consolidated/{}_read2.fastq.gz".format(basename)
            shell(
                "ln -srf {original} {new_name} && touch -h {new_name}")
        for i in input.new1:
            original = i
            basename = i.split("/")[-1].split("_")[0]
            new_name = "fastqs_consolidated/{}_read1.fastq.gz".format(basename)
            shell(
                "ln -sf {original} {new_name} && touch -h {new_name}")
        for i in input.new2:
            original = i
            basename = i.split("/")[-1].split("_")[0]
            new_name = "fastqs_consolidated/{}_read2.fastq.gz".format(basename)
            shell(
                "ln -sf {original} {new_name} && touch -h {new_name}")

We’re running this pipeline on a cluster using CentOS 3.9 and Slurm for job management. The pipeline fails at the above rule however, with the following error:

MissingOutputException in line 241 of /scratch/general/lustre/u6035429/DissAssembly/Snakefile:
Job Missing files after 5 seconds:
fastqs_consolidated/RANO330-OMH_Pedw1_read1.fastq.gz
fastqs_consolidated/RANO330-OMH_Pedw1_read2.fastq.gz
fastqs_consolidated/RANO332-OMH_Pedw2_read1.fastq.gz
fastqs_consolidated/RANO332-OMH_Pedw2_read2.fastq.gz
fastqs_consolidated/RANO54-OMH_Pedw3_read1.fastq.gz
fastqs_consolidated/RANO54-OMH_Pedw3_read2.fastq.gz
fastqs_consolidated/TSINJ32-OMH_Pdia1_read1.fastq.gz
fastqs_consolidated/TSINJ32-OMH_Pdia1_read2.fastq.gz
fastqs_consolidated/TSINJ38-OMH_Pdia2_read1.fastq.gz
fastqs_consolidated/TSINJ38-OMH_Pdia2_read2.fastq.gz
fastqs_consolidated/TSINJ47-OMH_Pdia3_read1.fastq.gz
fastqs_consolidated/TSINJ47-OMH_Pdia3_read2.fastq.gz
fastqs_consolidated/JEJ01-OMH_Pcan1_read1.fastq.gz
fastqs_consolidated/JEJ01-OMH_Pcan1_read2.fastq.gz
fastqs_consolidated/JEJ3-11-OMH_Pcan2_read1.fastq.gz
fastqs_consolidated/JEJ3-11-OMH_Pcan2_read2.fastq.gz
fastqs_consolidated/MERY3-OMH_Pcan3_read1.fastq.gz
fastqs_consolidated/MERY3-OMH_Pcan3_read2.fastq.gz
fastqs_consolidated/ANAL10-OMH_Pper1_read1.fastq.gz
fastqs_consolidated/ANAL10-OMH_Pper1_read2.fastq.gz
fastqs_consolidated/TOBI5-1-OMH_Pper2_read1.fastq.gz
fastqs_consolidated/TOBI5-1-OMH_Pper2_read2.fastq.gz
fastqs_consolidated/TOBI5-3-OMH_Pper3_read1.fastq.gz
fastqs_consolidated/TOBI5-3-OMH_Pper3_read2.fastq.gz
fastqs_consolidated/DAR4-11-OMH_Ptat1_read1.fastq.gz
fastqs_consolidated/DAR4-11-OMH_Ptat1_read2.fastq.gz
fastqs_consolidated/DAR4-39-OMH_Ptat2_read1.fastq.gz
fastqs_consolidated/DAR4-39-OMH_Ptat2_read2.fastq.gz
fastqs_consolidated/DAR4-5-OMH_Ptat3_read1.fastq.gz
fastqs_consolidated/DAR4-5-OMH_Ptat3_read2.fastq.gz
fastqs_consolidated/JAM4-16-OMH_Pcor1_read1.fastq.gz
fastqs_consolidated/JAM4-16-OMH_Pcor1_read2.fastq.gz
fastqs_consolidated/JAM4-20-OMH_Pcor2_read1.fastq.gz
fastqs_consolidated/JAM4-20-OMH_Pcor2_read2.fastq.gz
fastqs_consolidated/JAM4-7-OMH_Pcor3_read1.fastq.gz
fastqs_consolidated/JAM4-7-OMH_Pcor3_read2.fastq.gz
fastqs_consolidated/KIBO15-OMH_Pdec1_read1.fastq.gz
fastqs_consolidated/KIBO15-OMH_Pdec1_read2.fastq.gz
fastqs_consolidated/KIBO36-OMH_Pdec2_read1.fastq.gz
fastqs_consolidated/KIBO36-OMH_Pdec2_read2.fastq.gz
fastqs_consolidated/KIBO44-OMH_Pdec3_read1.fastq.gz
fastqs_consolidated/KIBO44-OMH_Pdec3_read2.fastq.gz
fastqs_consolidated/KMTEA7-10-OMH_Pver1_read1.fastq.gz
fastqs_consolidated/KMTEA7-10-OMH_Pver1_read2.fastq.gz
fastqs_consolidated/KMTEA7-2-OMH_Pver2_read1.fastq.gz
fastqs_consolidated/KMTEA7-2-OMH_Pver2_read2.fastq.gz
fastqs_consolidated/KMTEA7-4-OMH_Pver3_read1.fastq.gz
fastqs_consolidated/KMTEA7-4-OMH_Pver3_read2.fastq.gz
fastqs_consolidated/DASI5-08-OMH_Alang1_read1.fastq.gz
fastqs_consolidated/DASI5-08-OMH_Alang1_read2.fastq.gz
fastqs_consolidated/DASI5-16-OMH_Alang2_read1.fastq.gz
fastqs_consolidated/DASI5-16-OMH_Alang2_read2.fastq.gz
fastqs_consolidated/DASI5-21-OMH_Alang3_read1.fastq.gz
fastqs_consolidated/DASI5-21-OMH_Alang3_read2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

Removing output files of failed job consolidate_fastqs since they might be corrupted:
fastqs_consolidated/SRR1657028_read1.fastq.gz, fastqs_consolidated/SRR1657028_read2.fastq.gz, fastqs_consolidated/SRR1657029_read1.fastq.gz, fastqs_consolidated/SRR1657029_read2.fastq.gz, fastqs_consolidated/SRR1575526_read1.fastq.gz,

And so on for the other SRA samples.

We’ve played with increasing the latency time to no avail. Moreover, we’ve monitored the fastqs_consolidated directory in real time and the rule is producing functioning symlinks for all the fastq files. What’s even stranger is that for whatever reason it is not recognizing the output for the new samples but only deleting the symlinks for the SRA samples. It does this even if we manually create symlinks. Despite the supposed missing output consisting of the new samples, functioning symlinks to those fastq files remain in the output directory (fastqs_consolidated).

We’ve tried various iterations of symlink commands, including running with and without the “-r” flag and with and without relative paths from the symlink to the original file (i.e., both ../{original} and {original}).

Do you have any suggestions for what might be going wrong?

Snakemake not recognizing functioning symlinks as output and deleting some, but not all, of the output

Answers (1)

Related Questions