hena
hena

Reputation: 11

Running a snakemake pipeline for multiple datasets

I have a snakemake pipeline with rules that call other programs and custom R and python scripts.

I have multiple datasets on which this same pipeline needs to run. Usually I would make a separate folder for each dataset and put a config file specific to the dataset and run it individually.

As I have 20+ datasets this time, I was wondering if there is a more automated way to do this. There are mainly 4 parameters which change between the datasets: input file location, primer, quality control parameter and output dir for results. Is there a way to have a 'master' config file which would have information on these 4 parameters and a snakefile which then calls the second snakefile as many times as the number for datasets?

This whole idea seems like a for loop to me which loops through arrays of these 4 parameters but I can't figure out how to implement it in snakemake.

Any suggestions and ideas are welcome! Thanks Hena

Upvotes: 1

Views: 621

Answers (1)

bli
bli

Reputation: 8194

Provided all the parameters are somewhat "encoded" in the output file names, I think this can be done using a single snakefile.

Your main configuration file would include a section for each dataset, and this section could contain the desired output directory as well as a path to a configuration file specific to this dataset.

Proof of concept:

Snakefile:

import yaml

datasets = list(config.keys())

results = []
for dataset in datasets:
    out_dir = config[dataset]["out_dir"]
    with open(config[dataset]["conf"]) as conf_fh:
        dat_conf = yaml.safe_load(conf_fh)
        p1 = dat_conf["p1"]
        p2 = dat_conf["p2"]
        p3 = dat_conf["p3"]
        p4 = dat_conf["p4"]
    results.append(f"{out_dir}/{p1}_{p2}_{p3}_{p4}.out")


rule all:
    input:
        results


rule make_output:
    output:
        "{out_dir}/{p1}_{p2}_{p3}_{p4}.out"
    shell:
        "touch {output[0]}"

main_config.yaml:

dat1:
    out_dir: "dat1"
    conf: "dat1_conf.yaml"
dat2:
    out_dir: "dat2"
    conf: "dat2_conf.yaml"

dat1_conf.yaml:

p1: "A"
p2: "a"
p3: "1"
p4: "01"

dat2_conf.yaml:

p1: "B"
p2: "b"
p3: "2"
p4: "02"

Can be executed, for instance, as follows:

snakemake --snakefile Snakefile --configfile main_config.yaml -j 2

This creates the following result files:

dat1/A_a_1_01.out
dat2/B_b_2_02.out

Upvotes: 1

Related Questions