Reputation: 700
I have an R script in my workflow that requires the number of entries on several csv files (a.csv
, b.csv
, c.csv
; formatted with headers) as a value. Since they all have a string something
in every line, I thought I could write the rule as follows:
configfile: config.yaml
WILDCARD = config['wildcard']
TEMP_DIR = "~/temp"
rule all:
input:
f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
rule combine_geno_pheno_data_sibs:
input:
f"{TEMP_DIR}/file.txt",
f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
f"config['file']",
output:
f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
params:
n_lines = shell(
"grep -c something ../resources/{{WILDCARD}}.csv | xargs"
)
script:
"scripts/use_lines.R"
config.yaml
contains
wildcard:
- a
- b
- c
and n_lines
is called in R as snakemake@params$n_lines
.
The way the expansion is interpreted in shell()
, though, is as grep -c something ../resources/a b c.csv
, how do I get it to interpret the wildcards as e.g. grep -c something ../resources/a.csv
and return the value to n_lines
correctly?
Thanks in advance
Upvotes: 1
Views: 108
Reputation: 16551
Given the config.yaml
, the WILDCARD
contains a list of string values:
WILDCARD = config['wildcard']
# WILDCARD = ['a', 'b', 'c']
By applying f-string formatting to rule all
input, you are really requesting the following file:
rule all:
input: f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
# "~/temp/folder/{WILDCARD}/output.txt"
It's not clear if you are then requesting specific files from command line or if the code in the question is slightly inconsistent, but to fix your rule, consider defining a utility function that counts number of lines containing "something":
def _count_lines(f):
path = f"../resources/{f}.csv"
num_lines = sum(1 for line in open(path) if "something" in line)
return num_lines
Then the rule combine_geno_pheno_data_sibs
will look like:
rule combine_geno_pheno_data_sibs:
input:
f"{TEMP_DIR}/file.txt",
f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
f"config['file']",
output:
f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
params:
n_lines = lambda wildcards: _count_lines(wildcards.WILDCARD)
script:
"scripts/use_lines.R"
Upvotes: 0
Reputation: 9062
And I didn't really follow what your pipeline does but perhaps this is what you want?
configfile: config.yaml
WILDCARD = ['a', 'b', 'c']
TEMP_DIR = "~/temp"
rule all:
input:
expand("{TEMP_DIR}/folder/{WILDCARD}/output.txt", TEMP_DIR=TEMP_DIR, WILDCARD=WILDCARD),
rule combine_geno_pheno_data_sibs:
input:
somefile= "{TEMP_DIR}/file.txt",
anotherfile= "{TEMP_DIR}/folder/{WILDCARD}/another_file.txt",
csv= lambda wc: '../resources/%s.csv' % wc.WILDCARD,
output:
"{TEMP_DIR}/folder/{WILDCARD}/output.txt"
script:
"scripts/use_lines.R"
I would get the n_lines
counts inside the R script with standard R code. E.g.
csv <- read.table(snakemake@input$csv, ...)
n_lines <- nrow(csv[csv$somecol == 'something', ])
I don't have much sympathy for script
directive because it can see everything in the snakemake workflow so it is not clear what it takes in input. I prefer to write a standalone R script that you execute via shell
directive. But this is unrelated to your question...
(Note that in your rule all
you miss the input
directive but I guess this is a typo).
Upvotes: 1