Haja
Haja

Reputation: 11

Looking for a good output format to use a value extracted from a file in new script/process in Nextflow

Subject: Looking for a good output format to use a value extracted from a file in new script/process in Nextflow

I can't seem to figure this one out:

I am writing some processes in Nextflow in which I'm extracting a value from a txt.file (PROCESS1) and I want to use it in a second process (PROCESS2). The extraction of the value is no problem but finding the suitable output format is. The problem is that when I save the stdout (OPTION1) to a channel there seems to be some kind of "/n" attached which gives problems in my second script.

Alternatively because this was not working I wanted to save the output of PROCESS1 as a file (OPTION2). Also this is no problem but I can't find the correct way to read the content of the file in PROCESS2. I suspect it has something to do with "getText()" but I tried several things and they all failed.

Finally I wanted to try to save the output as a variable (OPTION3) but I don't know how to do this.

PROCESS1

process txid {
    publishDir "$wanteddir", mode:'copy', overwrite: true

    input:
    file(report) from report4txid

    output:
    stdout into txid4assembly           //OPTION 1
    file(txid.txt) into txid4assembly   //OPTION 2
    val(txid) into txid4assembly        //OPTION 3: doesn't work


    shell:
    '''
    column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5            //OPTION1
    column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid.txt //OPTION2
    column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid     //OPTION3

    '''
}

PROCESS2

process accessions {
    publishDir "$wanteddir", mode:'copy', overwrite: true

    input:
    val(txid) from txid4assembly       //OPTION1 & OPTION3
    file(txid) from txid4assembly      //OPTION2

    output:
    file("${txid}accessions.txt") into accessionlist

    script:
    """
    esearch -db assembly -query '${txid}[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' \
    | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > ${txid}accessions.txt
    """
}

RESULTING SCRIPT OF PROCESS2 AFTER OPTION 1 (remark: output = 573, lay-out unchanged)

esearch -db assembly -query '573
  [txid] AND "complete genome"[filter] AND "latest refseq"[filter]'     | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > 573
  accessions.txt

Thank you for your help!

Upvotes: 1

Views: 890

Answers (2)

Haja
Haja

Reputation: 11

I eventually fixed it by adding the following code, which only gets the numbers from my output

... | tr -dc '0-9'

Upvotes: 0

Steve
Steve

Reputation: 54502

As you've discovered, your command-line writes a trailing newline character. You could try removing it somehow, perhaps by piping to another command, or (better) by refactoring to properly parse your report files. Below is an example using to print the fifth column without a trailing newline character. This might work fine for a simple CSV report file, but the CSV parsing capabilities of AWK are limited. So if your reports could contain quoted fields etc, consider using a language that offers CSV parsing in it's standard library (e.g. Python and the csv libary, or Perl and the Text::CSV module). Nextflow makes it easy to use your favourite scripting language.

process txid {
    publishDir "$wanteddir", mode:'copy', overwrite: true

    input:
    file(report) from report4txid

    output:
    stdout into txid4assembly

    shell:
    '''
    awk -F, '$4 == "S" { printf("%s", $5); exit }' "!{report}"
    '''

In the case where your file contains an "S" in the forth column and the fifth column has some value with string length >= 1, this will give you a value that you can use in your 'accessions' process. But please be aware that this won't handle the case where the forth column in your file is never equal to "S". Nor will it handle the case where your fifth column could be an empty value (string length == 0). In these cases 'stdout' will be empty, so you'll get an empty value in your output channel. You may want to add some code to make sure that these edge cases are handled somehow.

Upvotes: 0

Related Questions