Damilola Decarls
Damilola Decarls

Reputation: 13

Is there a function in snakemake to make the list of output dependent on the arguments passed into the shell command

I have a snakemake rule that calls a python program, the output of the python program is dependent on the arguments passed. I would like to make snakemake aware of the differences between the expected output when a certain parameter is passed (Boolean) or not passed.

My current solution is to create a list of output list_phen_gen_output dependent on the configuration of the arguments. However, this increases exponentially when the source program takes in 3 arguments that alter the list of outputs produced by the source program.

See my current solution for one of the arguments. For when extract_genotypes == "T" or when extract_genotypes is false.

if extract_genotypes == "T":
    list_phen_gen_output = [f"{output_dir}phen_{breed}.txt",
        f"{output_dir}non_phenotyped_{breed}.txt",
        f"{output_dir}listcodeall{breed}.txt",
        f"{output_dir}genotypes_{breed}.txt"]
else:
    list_phen_gen_output = [f"{output_dir}phen_{breed}.txt",
        f"{output_dir}non_phenotyped_{breed}.txt",
        f"{output_dir}listcodeall{breed}.txt"]

rule create_phen_gen:
    input:
        f"{output_dir}/ZW.{breed_code}.fwf",
        f"{output_dir}/all_phenotypes.fwf",
    output:
        list_phen_gen_output
    log:
        f"{output_dir}SNAKEMAKE_{breed_code}.log"
    shell:
        f"python {SOURCE}wr_workflow.py {code} {YYMM_S} {extract_genotypes} {run_validation} {post_2000} {val_folder}"

How can I make snakemake outputs dependent on the input parameters of the source program?

Upvotes: 1

Views: 501

Answers (2)

Troy Comi
Troy Comi

Reputation: 2059

I don't think there is really a way to use functions as output files in snakemake. You are specializing the rule to one sample at the moment, but if you wanted to extend it to multiple breeds, you will likely need to use checkpoints instead. The basic setup is to make create_phen_gen a checkpoint, output the parent folder as a directory, then the "consuming" rule needs to check the output to decide what to do.

For your current setup (which is fine for a single breed) you can simplify the logic and duplication somewhat. I'm assuming var == "T" indicates an additional file will be present in the outputs:

list_phen_gen_output = [
        f"{output_dir}phen_{breed}.txt",
        f"{output_dir}non_phenotyped_{breed}.txt",
        f"{output_dir}listcodeall{breed}.txt",
    ]
if extract_genotypes == "T":
    list_phen_gen_output.append(f"{output_dir}genotypes_{breed}.txt")

if OTHER_THING == "T":
    list_phen_gen_output.append(f"{output_dir}OTHER_{breed}.txt")
else:
    list_phen_gen_output.append(f"{output_dir}ALT_{breed}.txt")

Should only grow linearly with the number of options.

Upvotes: 1

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

One solution is to define a table/dataframe that lists combinations of variables for which the file should be generated. This could look like this:

from io import StringIO
csv_file = StringIO("""
file_name,var1,var2,var3
abc.txt,T,T,F
def.txt,F,T,F
ghi.txt,F,F,F
""")

from pandas import read_csv
df = read_csv(csv_file)

# filter the files that satisfy your criteria
# assuming that var1,var2,var3 are defined
mask = (df['var1']==var1) & (df['var2']==var2) & (df['var3']==var3)
list_phen_gen_output = df.loc[mask, 'file_name'].values.tolist()

Upvotes: 1

Related Questions