Reputation: 93
I'm using bcftools consensus
to extract haplotypes from a vcf file.
Given the input files:
A.sorted.bam
B.sorted.bam
The following output files are created:
A.hap1.fna
A.hap2.fna
B.hap1.fna
B.hap2.fna
I currently have two rules to do this. They differ only by the numbers 1 and 2 in the output files and shell command. Code:
rule consensus1:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap1.fna"
params:
sample="{sample}"
shell:
"bcftools consensus -i -s {params.sample} -H 1 -f {reference_file} {input.vcf} > {output}"
rule consensus2:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap2.fna"
params:
sample="{sample}"
shell:
"bcftools consensus -i -s {params.sample} -H 2 -f {reference_file} {input.vcf} > {output}"
While this code works, it seems that there should be a better, more pythonic way to do this using only one rule. Is it possible to collapse this into one rule, or is my current method the best way?
Upvotes: 1
Views: 1370
Reputation: 4089
Use wildcards for haplotypes 1 and 2 in rule all
. See here to learn more about adding targets via rule all
reference_file = "ref.txt"
rule all:
input:
expand("haplotypes/{sample}.hap{hap_no}.fna",
sample=["A", "B"], hap_no=["1", "2"])
rule consensus1:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap{hap_no}.fna"
params:
sample="{sample}",
hap_no="{hap_no}"
shell:
"bcftools consensus -i -s {params.sample} -H {params.hap_no} \
-f {reference_file} {input.vcf} > {output}"
Upvotes: 2