Reputation: 1070

Output files with mixed wildcards

I'm stuck on how to do this with Snakemake.

First, say my "all" rule is:

rule all:
    input: "SA.txt", "SA_T1.txt", "SA_T2.txt",
        "SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"

Notice that SA has two _T# files while SB has three such files, a crucial element of this.

Now I want to write a rule like this to generate these files:

rule X:
    output: N="S{X}.txt", T="S{X}_T{Y}.txt"
    (etc.)

But SnakeMake requires that both output templates have the same wildcards, which these don't. Further, even if SnakeMake could handle the multiple wildcards, it would presumably want to find a single filename match for the S{X}_T{Y}.txt template, but I want that to match ALL files where {X} matches the first template's {X}, i.e. I want output.T to be a list, not a single file. So it would seem the way to do it is:

def myExpand(T, wildcards):
    T = T.replace("{X}", wildcards.X)
    T = [T.replace("{Y}", S) for S in theXYs[wildcards.X]]
    return T

rule X:
    output: N="S{X}.txt", T=lambda wildcards: myExpand("S{X}_T{Y}.txt", wildcards)
    (etc.)

But I can't do this, because a lambda function cannot be used in an output section.

How do I do it?

It seems to me this argues for supporting lambda functions on output statements, providing a wildcards dictionary filled with values from already-parsed sections of the output statement.

Additions responding to comments:

The value of wildcard Y is needed because other rules have inputs for those files that have the wildcard Y.

My rule knows the different values for Y (and X) that it needs to work with from data read from a database into python dictionaries.

There are many values for X, and from 2 to 6 values of Y for each value of X. I don't think it makes sense to use separate rules for each value of X. However, I might be wrong as I recently learned that one can put a rule inside a loop, and create multiple rules.

More info about the workflow: I am combining somatic variant VCF files for several tumor samples from one person together into a single VCF file, and doing it such that for each called variant in any one tumor, all tumors not calling that variant are analyzed to determine read depth at the variant, which is included in the merged VCF file.

The full process involves about 14 steps, which could perhaps be as many as 14 rules. I actually didn't want to use 14 rules, but preferred to just do it all in one rule.

However, I now think the solution is to indeed use lots of separate rules. I was avoiding this partly because of the large number of unnecessary intermediate files, but actually, these exist anyway, temporarily, within a single large rule. With multiple rules I can mark them temp() so Snakemake will delete them at the end.

For the sake of fleshing out this discussion, which I believe is a legitimate one, let's assume a simple situation that might arise. Say that for each of a number of persons, you have N (>=2) tumor VCF files, as I do, and that you want to write a rule that will produce N+1 output files, one output file per tumor plus one more file that is associated with the person. Use wildcard X for person ID and wildcard Y for tumor ID within person X. Say that the operation is to put all variants present in ALL tumor VCF files into the person output VCF file, and all OTHER variants into the corresponding tumor output files for the tumors in which they appear. Say a single program generates all N+1 files from the N input files. How do you write the rule?

You want this:

rule:
    output: COMMON="{X}.common.vcf", INDIV="{X}.{Y}.indiv.vcf"
    input: "{X}.{Y}.vcf"
    shell: """
        getCommonAndIndividualVariants --inputs {input} \
        --common {output.COMMON} --indiv {output.INDIV}
        """

But that violates the rules for output wildcards.

The way I did it, which is less than satisfactory, but works, is to use two rules, the first of which has the output template with more wildcards, and the second the template with fewer wildcards, and having the second rule create temporary output files which are renamed to the final name by the first rule:

rule A:
    output: "{X}.{Y}.indiv.vcf"
    input: "{X}.common.vcf"
    run: "for infile in {input}: os.system('mv '+infile+'.tmp'+' '+infile)"

rule B:
    output: "{X}.common.vcf"
    input: lambda wildcards: \
        expand("{X}.{Y}.vcf", **wildcards, Y=getYfromDB(wildcards["X"]))
    params: OUT=lambda wildcards: \
        expand("{X}.{Y}.indiv.vcf.tmp", Y=getYfromDB(wildcards["X"]))
    shell: """
        getCommonAndIndividualVariants --inputs {input} \
        --common {output} --indiv {params.OUT}
        """

Upvotes: 3

Answers (2)

bli

Reputation: 8194

My understanding is that snakemake works by taking steps that look as follows:

Look at the name of a file it is asked to generate.
Find the rule that has a matching pattern in its output.
Infer the values of the wildcards from the above match.
Use the wildcards to determine what the name of the inputs of the chosen rule should be.

Your rule can generate its output without needing an input, so the problem of inferring the value of wildcard Y is not evident.

How does your rule know how many different values for Y it needs to work with ?

If you find a way to determine the values for Y just knowing the value for X and predefined "python-level" functions and variables, then there may be a way to have Y as an internal variable of your rule, and not a wildcard.

In this case, the workflow could be driven only by files S{X}.txt. The S{X}_T{Y}.txt would just be a byproduct of the execution of the rule, not in its explicit output.

Upvotes: 0

The Unfun Cat

Reputation: 31948

I do not know how the rest of your workflow looks, and what the best solution is is context-dependent.

1

What about splitting up the rule into two, one creating "SA.txt", "SA_T1.txt", "SA_T2.txt" and another "SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"?

2

Another possibility is to only have the {X} files in the output-directive, but have the rule create the other files, even though they are not in the output-directive. This does not work if the {Y} files are part of the DAG.

3 (Best solution?)

A third and potentially the best solution might be to have aggregated wildcards in the X rule and the rule that requires the output from X.

Then the solution would be

rule X:
    output: N="S{X_prime}.txt", T="S{Y_prime}.txt"

The rule which requires these files can look like:

rule all:
    input:
        expand("S{X_prime}", X_prime="A_T1 A_T2".split()),
        expand("S{Y_prime}", Y_prime="B_T1 B_T2 B_T3".split())

If this does not meet your requirements we can discuss it further :)

Ps. You might need to use wildcard_constraints to disambiguate the outputs of rule X.

list_of_all_valid_X_prime_values = "A_T1 A_T2".split()
list_of_all_valid_Y_prime_values = "B_T1 B_T2 B_T3".split()

wildcard_constraints:
    X_prime = "({})".format("|".join(list_of_all_valid_X_prime_values))
    Y_prime = "({})".format("|".join(list_of_all_valid_Y_prime_values))

rule all:
    ...

Upvotes: 0

Output files with mixed wildcards

Additions responding to comments:

Answers (2)

1

2

3 (Best solution?)

Related Questions