Giulio Centorame
Giulio Centorame

Reputation: 700

Use a shell script to obtain a params for every wildcard

I have an R script in my workflow that requires the number of entries on several csv files (a.csv, b.csv, c.csv; formatted with headers) as a value. Since they all have a string something in every line, I thought I could write the rule as follows:

configfile: config.yaml
WILDCARD = config['wildcard']
TEMP_DIR = "~/temp"

rule all:
    input:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"

rule combine_geno_pheno_data_sibs:
    input:
        f"{TEMP_DIR}/file.txt",
        f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
        f"config['file']",
    output:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
    params:
        n_lines = shell(
            "grep -c something ../resources/{{WILDCARD}}.csv | xargs"
        )
    script:
        "scripts/use_lines.R"

config.yaml contains

wildcard:
   - a
   - b
   - c

and n_lines is called in R as snakemake@params$n_lines. The way the expansion is interpreted in shell(), though, is as grep -c something ../resources/a b c.csv, how do I get it to interpret the wildcards as e.g. grep -c something ../resources/a.csv and return the value to n_lines correctly?

Thanks in advance

Upvotes: 1

Views: 108

Answers (2)

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

Given the config.yaml, the WILDCARD contains a list of string values:

WILDCARD = config['wildcard']
# WILDCARD = ['a', 'b', 'c']

By applying f-string formatting to rule all input, you are really requesting the following file:

rule all:
   input: f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
#        "~/temp/folder/{WILDCARD}/output.txt"

It's not clear if you are then requesting specific files from command line or if the code in the question is slightly inconsistent, but to fix your rule, consider defining a utility function that counts number of lines containing "something":

def _count_lines(f):
    path = f"../resources/{f}.csv"
    num_lines = sum(1 for line in open(path) if "something" in line)
    return num_lines

Then the rule combine_geno_pheno_data_sibs will look like:

rule combine_geno_pheno_data_sibs:
    input:
        f"{TEMP_DIR}/file.txt",
        f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
        f"config['file']",
    output:
        f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
    params:
        n_lines = lambda wildcards: _count_lines(wildcards.WILDCARD)
    script:
        "scripts/use_lines.R"

Upvotes: 0

dariober
dariober

Reputation: 9062

And I didn't really follow what your pipeline does but perhaps this is what you want?

configfile: config.yaml
WILDCARD = ['a', 'b', 'c']
TEMP_DIR = "~/temp"

rule all:
    input:
        expand("{TEMP_DIR}/folder/{WILDCARD}/output.txt", TEMP_DIR=TEMP_DIR, WILDCARD=WILDCARD),

rule combine_geno_pheno_data_sibs:
    input:
        somefile= "{TEMP_DIR}/file.txt",
        anotherfile= "{TEMP_DIR}/folder/{WILDCARD}/another_file.txt",
        csv= lambda wc:  '../resources/%s.csv' % wc.WILDCARD,
    output:
        "{TEMP_DIR}/folder/{WILDCARD}/output.txt"
    script:
        "scripts/use_lines.R"

I would get the n_lines counts inside the R script with standard R code. E.g.

csv <- read.table(snakemake@input$csv, ...)
n_lines <- nrow(csv[csv$somecol == 'something', ])

I don't have much sympathy for script directive because it can see everything in the snakemake workflow so it is not clear what it takes in input. I prefer to write a standalone R script that you execute via shell directive. But this is unrelated to your question...

(Note that in your rule all you miss the input directive but I guess this is a typo).

Upvotes: 1

Related Questions