Reputation: 33
I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:
SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz
In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:
SAMPLENAME_READ.fastq.gz
In a directory reads/renamed_raw_fastq
My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.
My snakefile is as follows:
# Get sample names from read file names in the "raw" directory
readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'
import os
samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()
# Generate simplified names
readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'
newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])
# Create symlinks
import glob
def getRawName(wildcards):
rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
return rawName
rule all:
input: newNames
rule rename:
input: getRawName
output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
shell: "ln -sf {input} {output}"
When I run snakemake, it tries to generate the symlinks as expected but:
Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.
Throws errors like:
MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?
Thanks!
Upvotes: 3
Views: 158
Reputation: 9062
I think
ln -sf {input} {output}
gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:
def getRawName(wildcards):
rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
return rawName
(As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)
Upvotes: 2