Nextflow: publishDir, output channels, and output subdirectories

Question

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.

Here is the nextflow process in question:

process GupcallBases {
    publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
    
    executor = 'pbspro'
    clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
     
    output:
    path "*.bam" into bams_ch
            
    script:
    """
    module load cuda/11.4.2
    singularity exec --nv $params.Gup_container \
            guppy_basecaller --config $params.P1_gupConf \
            --device "cuda:0" \
            --bam_out \
            --recursive \
            --compress \
            --align_ref $params.refGen \
            -i $params.P1_inDir \
            -s $params.P1_outDir \
            --gpu_runners_per_device $params.P1_GPU_runners \
            --num_callers $params.P1_callers
    """
}

The output of the process is something like this:

$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)

I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.

The output syntax was chosen since once this process is done, using the following channel works:

//    Channel
//      .fromPath("${params.P1_outDir}/pass/*.bam")
//      .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
//      .set { bams_ch }

But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.

Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.

Thanks in advance.

Steve · Accepted Answer

Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.

The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.

Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:

params.publishDir = './results'

input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )

process GuppyBasecaller {

    publishDir(
        path: "${params.publishDir}/GuppyBasecaller",
        mode: 'copy',
        saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
    )
    beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
    container '/path/to/guppy_basecaller.img'

    input:
    path input_dir
    path guppy_config
    path ref_genome

    output:
    path "outdir/pass/*.bam" into bams_ch

    """
    mkdir outdir
    guppy_basecaller \
        --config "${guppy_config}" \
        --device "cuda:0" \
        --bam_out \
        --recursive \
        --compress \
        --align_ref "${ref_genome}" \
        -i "${input_dir}" \
        -s outdir \
        --gpu_runners_per_device "${params.guppy_gpu_runners}" \
        --num_callers "${params.guppy_callers}"
    """
}

Nextflow: publishDir, output channels, and output subdirectories

Answers (1)

Related Questions