Reputation: 11
Subject: Looking for a good output format to use a value extracted from a file in new script/process in Nextflow
I can't seem to figure this one out:
I am writing some processes in Nextflow in which I'm extracting a value from a txt.file (PROCESS1) and I want to use it in a second process (PROCESS2). The extraction of the value is no problem but finding the suitable output format is. The problem is that when I save the stdout (OPTION1) to a channel there seems to be some kind of "/n" attached which gives problems in my second script.
Alternatively because this was not working I wanted to save the output of PROCESS1 as a file (OPTION2). Also this is no problem but I can't find the correct way to read the content of the file in PROCESS2. I suspect it has something to do with "getText()" but I tried several things and they all failed.
Finally I wanted to try to save the output as a variable (OPTION3) but I don't know how to do this.
PROCESS1
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly //OPTION 1
file(txid.txt) into txid4assembly //OPTION 2
val(txid) into txid4assembly //OPTION 3: doesn't work
shell:
'''
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 //OPTION1
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid.txt //OPTION2
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid //OPTION3
'''
}
PROCESS2
process accessions {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
val(txid) from txid4assembly //OPTION1 & OPTION3
file(txid) from txid4assembly //OPTION2
output:
file("${txid}accessions.txt") into accessionlist
script:
"""
esearch -db assembly -query '${txid}[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > ${txid}accessions.txt
"""
}
RESULTING SCRIPT OF PROCESS2 AFTER OPTION 1 (remark: output = 573, lay-out unchanged)
esearch -db assembly -query '573
[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > 573
accessions.txt
Thank you for your help!
Upvotes: 1
Views: 890
Reputation: 11
I eventually fixed it by adding the following code, which only gets the numbers from my output
... | tr -dc '0-9'
Upvotes: 0
Reputation: 54502
As you've discovered, your command-line writes a trailing newline character. You could try removing it somehow, perhaps by piping to another command, or (better) by refactoring to properly parse your report files. Below is an example using awk to print the fifth column without a trailing newline character. This might work fine for a simple CSV report file, but the CSV parsing capabilities of AWK are limited. So if your reports could contain quoted fields etc, consider using a language that offers CSV parsing in it's standard library (e.g. Python and the csv libary, or Perl and the Text::CSV module). Nextflow makes it easy to use your favourite scripting language.
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly
shell:
'''
awk -F, '$4 == "S" { printf("%s", $5); exit }' "!{report}"
'''
In the case where your file contains an "S" in the forth column and the fifth column has some value with string length >= 1, this will give you a value that you can use in your 'accessions' process. But please be aware that this won't handle the case where the forth column in your file is never equal to "S". Nor will it handle the case where your fifth column could be an empty value (string length == 0). In these cases 'stdout' will be empty, so you'll get an empty value in your output channel. You may want to add some code to make sure that these edge cases are handled somehow.
Upvotes: 0