helgis
helgis

Reputation: 85

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.

In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?

runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole  = ['big','small']

rule all:
    input:
        expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)

rule merge_results:
    input:
        expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
    output:
        os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
    run:
        import pandas as pd
        import os

        # Iterate over input files, read into pandas df
        tmplist = []
        for f in input:
            data = pd.read_csv(f, sep='\t')

            # getting the pheno from the input file and adding it to the data frame
            pheno = os.path.split(f)[1].split('.')[0]
            data['pheno'] = pheno

            tmplist.append(data)

        resmerged = pd.concat(tmplist)

        resmerged.to_csv(output, sep='\t')

Upvotes: 1

Views: 275

Answers (1)

Eric C.
Eric C.

Reputation: 3368

You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.

In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:

run:
    import pandas as pd
    import os

    # Iterate over phenotypes, read into pandas df
    tmplist = []
    for pheno in runpheno:

        # conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example 
        file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')

        data = pd.read_csv(file, sep='\t')
        data['pheno'] = pheno

        tmplist.append(data)

    resmerged = pd.concat(tmplist)

    resmerged.to_csv(output, sep='\t')

I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

Upvotes: 1

Related Questions