Getting wildcard from input files when not used in output files

Question

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.

In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?

runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole  = ['big','small']

rule all:
    input:
        expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)

rule merge_results:
    input:
        expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
    output:
        os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
    run:
        import pandas as pd
        import os

        # Iterate over input files, read into pandas df
        tmplist = []
        for f in input:
            data = pd.read_csv(f, sep='	')

            # getting the pheno from the input file and adding it to the data frame
            pheno = os.path.split(f)[1].split('.')[0]
            data['pheno'] = pheno

            tmplist.append(data)

        resmerged = pd.concat(tmplist)

        resmerged.to_csv(output, sep='	')

Getting wildcard from input files when not used in output files

Answers (1)

Related Questions