Reputation: 307
I have a functioning Snakefile for Partitioned Heritability using multiple bed files. This produces a perfect list of jobs using snakemake -np
so this file only needs a minor tweak (I hope!).
My issue occurs in the merge_peaks
rule below.
At this stage, I have 25 bed files and need to run 5 calls of merge_peaks
, one call for each extension ext=[100,200,300,400,500]
, so I need only the 5 bed files containing the relevant extension to be called each time.
For example for the following merge_peaks
output file peak_files/Fullard2018_peaks.mrgd.blrm.100.bed
, I only need the following 5 bed files with ext=100
to be used as input:
peak_files/fullard2018_NpfcATAC_1.blrm.100.bed
peak_files/fullard2018_NpfcATAC_2.blrm.100.bed
peak_files/fullard2018_NpfcATAC_3.blrm.100.bed
peak_files/fullard2018_NpfcATAC_4.blrm.100.bed
peak_files/fullard2018_NpfcATAC_5.blrm.100.bed
Here is my config file:
samples:
fullard2018_NpfcATAC_1:
fullard2018_NpfcATAC_2:
fullard2018_NpfcATAC_3:
fullard2018_NpfcATAC_4:
fullard2018_NpfcATAC_5:
index: /home/genomes_and_index_files/hg19.chrom.sizes
Here is the Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample=config['samples'], ext=[100,200,300,400,500]),
expand("LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz", sample=config['samples'], ext=[100,200,300,400,500], chr=range(1,23))
rule annot2bed:
input:
folder = "Reference/baseline"
params:
file = "Reference/baseline/baseline.{chr}.annot.gz"
output:
"LD_annotation_files/baseline.{chr}_no_head.bed"
shell:
"zcat {params.file} | tail -n +2 |awk -v OFS=\"\t\" '{{print \"chr\"$1, $2-1, $2, $3, $4}}' "
"| sort -k1,1 -k2,2n > {output}"
rule extend_bed:
input:
"peak_files/{sample}_peaks.blrm.narrowPeak"
output:
"peak_files/{sample}.blrm.{ext}.bed"
params:
ext = "{ext}",
index = config["index"]
shell:
"bedtools slop -i {input} -g {params.index} -b {params.ext} > {output}"
rule merge_peaks:
input:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
output:
"peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed"
shell:
"cat {input} | bedtools sort -i stdin | bedtools merge -i stdin > {output}"
rule intersect_mybed:
input:
annot = rules.annot2bed.output,
mybed = rules.merge_peaks.output
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz"
params:
uncompressed = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot"
shell:
"echo -e \"CHR\tBP\tSNP\tCM\tANN\" > {params.uncompressed}; "
"/share/apps/bedtools intersect -a {input.annot} -b {input.mybed} -c | "
"sed 's/^chr//g' | awk -v OFS=\"\t\" '{{print $1, $2, $4, $5, $6}}' >> {params.uncompressed}; "
"gzip {params.uncompressed}"
rule ldsr:
input:
annot = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz",
bfile_folder = "Reference/1000G_plinkfiles",
snps_folder = "Reference/hapmap3_snps"
output:
"LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz"
conda:
"envs/p2-ldscore.yaml"
params:
bfile = "Reference/1000G_plinkfiles/1000G.mac5eur.{chr}",
ldscores = "LD_annotation_files/Fullard2018.{ext}.{chr}",
snps = "Reference/hapmap3_snps/hm.{chr}.snp"
log:
"logs/LDSC/Fullard2018.{ext}.{chr}_ldsc.txt"
shell:
"export MKL_NUM_THREADS=2;" # Export arguments are workaround as ldsr uses all available cores
"export NUMEXPR_NUM_THREADS=2;" # Numbers must match cores parameter in cluster config
"Reference/ldsc/ldsc.py --l2 --bfile {params.bfile} --ld-wind-cm 1 "
"--annot {input.annot} --out {params.ldscores} --print-snps {params.snps} 2> {log}"
What is currently happening is all 25 bed files are being fed to the merge peaks rule for each call, whereas I only need the 5 with the relevant extension to be fed in each time. I'm struggling to work out how to use the expand function correctly, or an alternate method, to include and merge only the relevant bed files in each call of the rule.
This question asks something similar, I think, but is not quite what I'm looking for as it doesn't use a config file - Snakemake: rule for using many inputs for one output with multiple sub-groups
I love Snakemake but my python is a bit dicey.
Many Thanks.
Upvotes: 1
Views: 1286
Reputation: 307
I have a solution for this using double curly brackets as an escape character. This works and gives me what I need, but there is still something going on that I don't understand.
So for the input line to the merge_peaks
rule instead of:
expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
I put {{}}
around the ext
like so:
expand("peak_files/{sample}.blrm.{{ext}}.bed", sample = config['samples'], ext=[100,200,300,400,500])
This escapes ext
from the expand statement so that I only get files containing a single extension value in the input statement for each call to merge_peaks
. However, instead of 5 files being passed to the input statement, one file per sample for a particular extension (see question), I get five copies of the same file being passed to the input:
cat
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_1.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_2.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_3.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_4.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
peak_files/fullard2018_NpfcATAC_5.blrm.500.bed \
| bedtools sort -i stdin | bedtools merge -i stdin > peak_files/Fullard2018_peaks.mrgd.blrm.500.bed`
Ultimately this has no bearing on the solution as it provides the output I need to move forward, but means I'm merging identical files unnecessarily. So there must be a better answer to the problem than this.
Upvotes: 0
Reputation: 9062
If I understand correctly, you managed to create one file per sample per extension (25 files in total) and now you want to merge files having the same extension. So your required final output should be:
peak_files/Fullard2018_peaks.mrgd.blrm.100.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.200.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.300.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.400.bed,
peak_files/Fullard2018_peaks.mrgd.blrm.500.bed
(For testing I create the 25 input files to be merged by extension):
mkdir -p peak_files
for i in 100 200 300 400 500
do
touch peak_files/fullard2018_NpfcATAC_1.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_2.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_3.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_4.blrm.${i}.bed
touch peak_files/fullard2018_NpfcATAC_5.blrm.${i}.bed
done
This snakefile should do the job. Of course, you can move samples
and exts
to config entries:
samples= ['fullard2018_NpfcATAC_1',
'fullard2018_NpfcATAC_2',
'fullard2018_NpfcATAC_3',
'fullard2018_NpfcATAC_4',
'fullard2018_NpfcATAC_5']
exts= [100, 200, 300, 400, 500]
rule all:
input:
expand('peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed', ext= exts),
rule merge_peaks:
input:
lambda wildcards: expand('peak_files/{sample}.blrm.{ext}.bed',
sample= samples, ext= wildcards.ext),
output:
'peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed',
shell:
r"""
cat {input} > {output}
"""
The lambda
function in merge_peaks
is saying for each extension ext give me a list of files, one file for each sample in "samples"
Upvotes: 1