David M
David M

Reputation: 149

Snakemake: How do I use a function that takes in a wildcard and returns a value?

I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids.

I have this function which does that in my Snakemake file:

def identify_read_groups(cram_file):
    import subprocess
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.split('\n')[:-1]
    return(read_groups)

I have this rule all:

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

And this rule to actually do the split:

rule split_cram_by_rg:
input:
    cram_file='cram/{sample}.bam.cram',
    read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
    'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
    import subprocess
    read_groups = open(input.readGroupIDs).readlines()
    read_groups = [str(rg.replace('\n','')) for rg in read_groups]
    for rg in read_groups:
        command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
        subprocess.check_output(command, shell=True)

I get this error when doing a dry run:

[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
  File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups

{sample} isn't being passed to the function.

How do I solve this problem? I'm open to other approaches if I'm not doing this in a 'snakemake-ic' way.

==============

EDIT 1

Ok, the first set of examples I gave had many many issues.

Here's a better (?) set of code, which I hope demonstrates my issue.

import sys
from os.path import join

shell.prefix("set -eo pipefail; ")

def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
    RG_dict[i] = identify_read_groups(i)

rule all:
    input:
        expand('{sample}.boo.txt', sample=list(RG_dict.keys()))

rule split_cram_by_rg:
    input:
        file='cram/{sample}.bam.cram',
        RG = lambda wildcards: RG_dict[wildcards.sample]
    output:
        expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
    shell:
        'samtools view -b -r {input.RG} {input.file} > {output}'


rule merge_RG_bams_into_one_bam:
    input:
        rules.split_cram_by_rg.output
    output:
        '{sample}.boo.txt'
    message:
        'echo {input}'
    shell:
        'samtools merge {input} > {output}' #not working
        """

==============

EDIT 2

Getting MUCH closer, but currently struggling with expand properly building the lane bam files and keeping the wildcards

I'm using this loop to create the intermediate file names:

for sample in SAMPLES:
    for rg_id in list(return_ID(sample)):
        out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))

return_ID is a function which takes the sample wildcard and returns a list of the read groups the sample contains

If I use out_rg_bam as an input for a merge rule, then ALL of the files get combined into a merged bam, instead of being split by sample.

If I use expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample)) then rg_id gets applied to each sample. So if I have two samples (a,b) , with read groups (0,1) and (0,1,2), I end up with a0, a1, a0, a1, a2 and b0, b1, b0, b1, b2.

Upvotes: 4

Views: 11111

Answers (3)

Colin
Colin

Reputation: 10820

I'm going to give a more general answer to help others that might find this thread. Snakemake only applies wildcards to strings in the 'input' and 'output' sections when the strings are directly listed, e.g.:

input:
    '{sample}.bam'

If you are trying to use functions like you were here:

input:
    read_groups=identify_read_groups('cram/{sample}.bam.cram')

The wildcard replacement will not be done. You can use a lambda function and do the replacement yourself:

input:
    read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))

Edit Feb 28, 2023: Of course now python has f-strings, which improves the syntax:

input:
    read_groups=lambda wildcards: identify_read_groups(f'cram/{wildcards.sample}.bam.cram')

Upvotes: 3

crazyhottommy
crazyhottommy

Reputation: 310

try this: I use id = 0, 1, 2, 3 to name the output bam file depending on how many readgroup for a bam file.

## this is a regular function which takes the cram file, and get the read-group to 
## construct your rule all
## you actually just need the number of @RG, below can be simplified  
def get_read_groups(sample):
    import subprocess
    cram_file = 'cram/' + sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups)

SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
    RG_dict[sample] = get_read_groups(sample)

outbam = []
for sample in SAMPLES:
    read_groups = RG_dict[sample]
    for i in range(len(read_groups)):
        outbam.append("{}.RG{}.bam".format(sample, id))


rule all:
    input:
        outbam

## this is the input function, only uses wildcards as argument 
def identify_read_groups(wildcards):
    import subprocess
    cram_file = 'cram/' + wildcards.sample + '.bam.cram'
    command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" '
    read_groups = subprocess.check_output(command, shell=True)
    read_groups = read_groups.decode().split('\n')[:-1]
    return(read_groups[wildcards.id])

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups
    output:
        'cram/RG_bams/{sample}.RG{id}.bam'  
    run:
        import subprocess
        read_groups = input.read_groups
        for rg in read_groups:
            command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
            subprocess.check_output(command, shell=True)

when use snakemake, think the way bottom up. First define what you want to generate in the rule all, and then construct the rule to create your final all.

Upvotes: 0

TBoyarski
TBoyarski

Reputation: 441

Your all rule cannot have wildcards. It's a no wildcard-zone.

EDIT 1

I typed this pseudo-code in Notepad++, its not meant to compile, just trying to provide a framework. I think this is more what you are after.

Use a function inside of an expand to generate a list of file names which will then be used to driver the Snakemake pipeline's all rule. The baseSuffix and basePrefix variables are just to give you an idea as to String passing, arguments are permitted here. When passing back the list of strings, you will have to unpack them to ensure Snakemake reads the result properly.

def getSampleFileList(String basePrefix, String baseSuffix){
    myFileList = []
    ListOfSamples = *The wildcard glob call*
    for sample in ListOfSamples:
        command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
        for rg in command:
            myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}


basePreix = "cram/RG_bams/"
baseSuffix = ".bam" 

rule all:
    input:
        unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))


rule processing_rg_files:
    input:
        'cram/RG_bams/{sample}.RG{read_groups}.bam'
    output:
        'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
    run:
        "Let's pretend this is useful code"

END OF EDIT

If it wasn't in the all rule, you'd use inline functions

So I'm not sure what you're trying to accomplish. As per my guesses, read below for some notes about your code.

rule all:
input:
    expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

The dry run is failing when it calls the function "identify_read_groups" inside the rule all call. It's being passed into your function call as a string, not a wildcard.

Technically, if the samtools call wasn't failing, and the function call "identify_read_groups(cram_file)" returned a list of 5 strings, it would expand to something like this:

rule all:
    input:
        'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
        'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'

But the term "{sample}", at this stage in Snakemake's pre-processing, is considered a string. As you needed to denote wildcards in an expand function with {{}}.

See how I address every Snakemake variable I declare for my rule all input call and don't use wildcards:

expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())

In this case read_table returns 2-dimensional array to form. Snakemake is well supported by python. I needed this for pairing of different annotations to different variant types.

Your rule all needs to be a string, or list of strings, as input. You cannot have wildcards in your "all" rule. These rule all input strings are what Snakemake uses to generate matches for OTHER wildcards. Build the entire filename in the function call and return it if you need to.

I think you should just turn it into something like this:

rule all:
input:
    expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))

Also consider updating this to be more generic.:

rule split_cram_by_rg:
    input:
        cram_file='cram/{sample}.bam.cram',
        read_groups=identify_read_groups('cram/{sample}.bam.cram')

It can have two or more wildcards (why we love Snakemake). You can access the wildcards later in the python "run" directive via the wildcards object, since it looks like you'll want to in your for each loop. I think input and output wildcards have to match, so maybe do try it this way as well.

rule split_cram_by_rg:
    input:
        'cram/{sample}.bam.cram'
    output:
        expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
    ...
    params:
          rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
    run:
        command = 'Just an example ' +  + str(params.rg)

Again, not super sure what you're trying to do, I'm not sure I like the idea of the function call twice, but hey, it would run ;P Also notice the use of a wildcard "sample" in the input directive within a string {} and in the output directive within an expand {{}}.

An example of accessing wildcards in your run directive

Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything. I use an external configfile here.

Upvotes: -1

Related Questions