snakemake - how to make a list of input files based on a previous rule that produces variable number of files

Question

Say, I am starting with a bunch of files like these:

group_1_in.txt, group_2_in.txt, group_3_in.txt

I process them using a rule that generates the directory structure shown below.

rule process_group_files:
    input: 'group_{num}_in.txt'
    output: directory('group_{num}')
    shell: "some_command {input} {output}'

## directory structure produced: 
group_1
    sample1_content.txt
    sample2_content.txt
    sample3_content.txt
group_2
    sample2_content.txt
    sample3_content.txt
    sample4_content.txt
group_3
    sample1_content.txt
    sample2_content.txt
    sample5_content.txt

Then, I have rule that processes them to aggregate files by sample:

rule aggregate_by_sample:
    input: expand('{group}/{sample}_content.txt')
    output: '{sample}_allcontent.txt'
    shell: "cat {input} | some_command > {output}"

I expect the inputs for this rule to be:

group_1/sample1_content.txt, group_3/sample1_content.txt
group_1/sample2_content.txt, group_2/sample2_content.txt, group_3/sample2_content.txt
group_1/sample3_content.txt, group_2/sample3_content.txt
group_2/sample4_content.txt 
group_3/sample5_content.txt

and produce the following output files:

sample1_allcontent.txt
sample2_allcontent.txt
sample3_allcontent.txt
sample4_allcontent.txt
sample5_allcontent.txt

At this point, I want to work with these output files. So, the rule for this can be something like:

rule process_by_sample:
    input: 
    output: final_output.txt 
    shell: "cat {input} | some_other_command > {output}"

My question is this: how can I tell snakemake to wait until it has finished processing all of the files in aggregate_by_sample rule, then use that set of output files for the rule process_by_sample? I explored the idea of checkpoints by making aggregate_by_sample a checkpoint but I should be using a 'directory' as output since I don't know apriori how many output files will be produced. But I cannot do that because my output file names use wildcards and snakemake complains that Wildcards in input files cannot be determined from output files.

EDIT -- After seeing the answer by @troy-comi I realized that I oversimplified my issue. I updated my question to include the first rule process_group_files. All I know at the beginning of the pipeline is how many groups I have and what the 'number' wildcard list is.

snakemake - how to make a list of input files based on a previous rule that produces variable number of files

Answers (1)

Related Questions