Reputation: 27
I am parsing a multi-fasta file into single fasta file and I want to create wildcards for each file because the next rule needs to be parallelized for each file. My problem is that I am not able to create a wildcard from the resulting fasta file because the output changes dynamicaly depending on the multi-fasta file I have. Here is my code:
rule all:
input:
final = "kmers/{sample}/results.fasta",
merge = "merge.fasta",
arc = "All_fasta/"
checkpoint pyt:
input:
mm = "fasta.fasta"
output:
arc = directory("All_fasta/")
script:
"pyt.py"
def some_func(wildcards):
checkpoint_in = checkpoints.pyt.get(**wildcards).output[0]
# I don't know what to do here
return "{sample}"
rule not_working:
input:
new_input = "{sample}"
output:
final = "kmers/{sample}/results.fasta"
shell:
"somecmd {input} > {output}"
rule merge:
input:
final = expand("kmers/{sample}/results.fasta",sample=sample)
output:
merge = "merge.fasta"
shell:
"cat {input} > {ouput}"
In brief, I want to create a wildcard for each fasta entry I have in my multi-fasta input file.
thank you!
Upvotes: 0
Views: 568
Reputation: 9062
I think this is what you want...
Input file fasta.fasta
is:
cat fasta.fasta
>a
AAAA
>b
CCCC
>c
TTTT
sequence_names
produced inside aggregate_input
is the list of sequence names from the original fasta file.
import glob
rule all:
input:
'merge.fasta',
checkpoint pyt:
input:
"fasta.fasta"
output:
all_fasta= directory("All_fasta")
shell:
# Split multifasta into single files. I assume this is what pyt.py
# does. This is just a dummy script
r"""
mkdir {output}
grep -A 1 '>a' {input} > {output}/a.fasta
grep -A 1 '>b' {input} > {output}/b.fasta
grep -A 1 '>c' {input} > {output}/c.fasta
"""
rule process_sequences:
input:
new_input= 'All_fasta/{sequence_name}.fasta',
output:
final= 'kmers/{sequence_name}/results.fasta',
shell:
"cp {input} {output}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.pyt.get().output['all_fasta']
fasta_files = sorted(glob.glob(os.path.join(checkpoint_output, '*.fasta')))
sequence_names = [re.sub('\.fasta$', '', os.path.basename(x)) for x in fasta_files]
return expand('kmers/{sequence_name}/results.fasta', sequence_name= sequence_names)
rule merge:
input:
final = aggregate_input,
output:
merge = "merge.fasta"
shell:
"cat {input} > {output}"
I avoided the use of glob_wildcards
since I find it too cryptic. I prefer to collect the output files using glob.glob
and parse them using python built-ins.
Upvotes: 1