nextflow: how to pass directory path with files created in process before

Question

I have three files: main.nf, index_process.nf and dummy_bwamem.nf

main.nf has output directory for index files to be created. The index is created using bwa aligner. I provide a small FASTA (neisseria meningitidis genome) to work through and learn.

My config.nf doesn't generate files as intended thus my code doesn't have it,

Below are the contents of scripts: main.nf

params.outdir_index_temp="./bwa_index_temp"
params.outdir_bwa_mem="./bwamem"
params.hg38genome ="/Users/username/Downloads/NM.fasta"

include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './dummy_bwamem.nf'

workflow {

bwa_index(params.hg38genome, params.outdir_index_temp)
align_bwa_mem(bwa_index.out,params.outdir_index_temp).view()
    
}

index_process.nf

process bwa_index {

    tag {ref_fasta.name}
    publishDir "${outdir}/", mode:"copy"

    input:
    path ref_fasta
    val outdir 

    output:
    tuple val(ref_fasta.name), path("${ref_fasta.name}.{ann,amb,sa,bwt,pac}")

    """
    bwa index "${ref_fasta}"
    """
}

dummy_bwamem.nf

process align_bwa_mem {

input :

    tuple val(fasta_name), path(indexfiles) //if you use val(pathindexfiles) you get work/ae/34567890/asdasdad and such

     val(path_index_output) 

    output: stdout

    script:

    """
    echo "$path_index_output $indexfiles
"
    echo "$fasta_name
"
    """
}

Output:

    nextflow run main.nf N E X T F L O W  ~  version 23.04.1 Launching `main.nf`  
 [goofy_cuvier] DSL2 - revision: 03a15979cf executor >  local (2) [84/73be26] process > bwa_index (NM.fasta) [100%] 1 of 1 ✔ [a1/75f4d0] process > align_bwa_mem        [100%] 1 of 1 ✔ ./bwa_index_temp NM.fasta.amb NM.fasta.ann NM.fasta.bwt NM.fasta.pac NM.fasta.sa
    
    NM.fasta

How do I make sure that the alignment index is read ./bwa_index_temp/NM.fasta.amb ./bwa_index_temp/NM.fasta.ann ./bwa_index_temp/NM.fasta.bwt ./bwa_index_temp/NM.fasta.pac ./bwa_index_temp/NM.fasta.sa

I created Nextflow code on top already put posts on stackoverflow, however, I'm pushing or being stubborn to write the code my way. There are things I'd like to learn instead of spoon-fed.

Steve · Accepted Answer

There are several issues here. Note that:

The params implicit variable is globally scoped. This means that your bwa_index process can already access params.outdir_index_temp directly, for example:

process bwa_index {

    publishDir params.outdir_index_temp, mode:"copy"

    ...
}

The parameters (e.g. params.hg38genome) defined above your workflow block are just regular strings (i.e. java.lang.String). If a process requires a file input, we should provide a channel either emitting one or more file objects (i.e. a queue channel) or bound to a single value (i.e. a value channel). Since we will later want to be able to read the bwa_index process outputs an unlimited number of times we want a value channel here:

workflow {

    hg38genome = file( params.hg38genome )
    
    bwa_index( hg38genome )

    ...

This part of the documentation is key here:

A value channel is implicitly created by a process when it is invoked with a simple value. Furthermore, a value channel is also implicitly created as output for a process whose inputs are all value channels.

The reason it is necessary to pass in file objects to our processes when they're required is to ensure our input files are properly staged into the process working directory when the task is run. This will make sure our workflow is portable and can be run in the cloud where processes are often run completely isolated from each other (i.e. there is no shared filesystem). In the same way, we should also avoid writing to files outside of the processing working directory.

We should never attempt to read from or write to any directory specified using the publishDir directive:

Files are copied into the specified directory in an asynchronous manner, so they may not be immediately available in the published directory at the end of the process execution. For this reason, downstream processes should not try to access output files through the publish directory, but through channels.

If your upstream process produces a directory which will be required downstream, you can just have your upstream process output the directory itself. However, if you will need to use/access the individual files in this directory, you would want to use a glob pattern to declare the output files in this directory. If your process doesn't already create an output directory, a better way is to just use a glob pattern to declare the files in the top level directory (i.e. the process working directory). Then if, in a downstream process, you would like to have the files staged into a subdirectory, you can specify a name pattern to have the files staged using their source file names and under the specified subdirectory¹. For example, here the individual index files are staged under a directory called bwa_index in each of the process working directories:

params.reads = '/Users/name/Downloads/tiny/normal/*_R{1,2}_xxx.fastq.gz'
params.hg38genome = '/Users/name/Downloads/NM.fasta'

params.outdir_index_temp = "./bwa_index_temp"
params.outdir_bwa_mem = "./bwamem"

include { bwa_index } from './index_process.nf'
include { align_bwa_mem } from './dummy_bwamem.nf'


workflow {

    reads = Channel.fromFilePairs( params.reads )

    hg38genome = file( params.hg38genome )

    bwa_index( hg38genome )

    align_bwa_mem( reads, bwa_index.out )
}

Contents of index_process.nf:

process bwa_index {

    tag { ref_fasta.name }

    publishDir params.outdir_index_temp, mode: "copy"

    input:
    path ref_fasta

    output:
    tuple val(ref_fasta.name), path("*.{ann,amb,sa,bwt,pac}")

    """
    bwa index "${ref_fasta}"
    """
}

Contents of dummy_bwamem.nf:

process align_bwa_mem {

    tag { sample }

    publishDir params.outdir_bwa_mem, mode: "copy"

    debug true

    input:
    tuple val(sample), path(reads)
    tuple val(idxbase), path("bwa_index/*")

    output:
    tuple val(sample), path("${sample}.bam")

    """
    echo "${sample}"
    echo bwa mem \
        "bwa_index/${idxbase}" \
        ${reads}

    touch "${sample}.bam"
    """
}

Results:

$ nextflow run main.nf
N E X T F L O W  ~  version 23.04.1
Launching `main.nf` [modest_venter] DSL2 - revision: ff53a69e4e
executor >  local (4)
[62/bb6cb8] process > bwa_index (NM.fasta) [100%] 1 of 1 ✔
[dc/2a4bfd] process > align_bwa_mem (foo)  [100%] 3 of 3 ✔
bar
bwa mem bwa_index/NM.fasta bar_R1_xxx.fastq.gz bar_R2_xxx.fastq.gz

baz
bwa mem bwa_index/NM.fasta baz_R1_xxx.fastq.gz baz_R2_xxx.fastq.gz

foo
bwa mem bwa_index/NM.fasta foo_R1_xxx.fastq.gz foo_R2_xxx.fastq.gz

nextflow: how to pass directory path with files created in process before

Answers (1)

Related Questions