Nextflow bioinformatics snp calling beginner question: pair of files output of one process as input of another process

Question

I have just started learning nextflow. I have been following the rna seq tutorial provided here: https://training.nextflow.io/basic_training/rnaseq_pipeline/#quality-control but there are a couple of things that I cannot reimplement in my own workflow. I am slowly trying to build a SNP calling workflow from a pipeline I use often, just as a test for me to learn how to use nextflow.

This is my code, it uses two containers which are specified in the nexflow.config file.


params.reads = "/home/*_R{1,2}*.gz"
params.outdir = "./"
params.ref_genome = "/home/GCF_000001735.4_TAIR10.1_genomic.fasta"

log.info """\
    S N P   C A L L I N G   P I P E L I N E
    ===================================
    reference    : ${params.ref_genome}
    reads        : ${params.reads}
    outdir       : ${params.outdir}
    """
    .stripIndent(true)


/*
 * Processes
 */
process trimSequences {
    tag "Fastp on $sample_id"
    publishDir params.outdir, mode: 'copy'
    cpus 4
    debug true

    input:
    tuple val(sample_id), path(reads)


    output:
    tuple val(sample_id), path("${sample_id}_{1,2}_trimmed.fastq.gz")

    script:
    """
    fastp -w $task.cpus -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_1_trimmed.fastq.gz -O ${sample_id}_2_trimmed.fastq.gz
    """

}



process bwaIndex {
    publishDir params.outdir, mode: 'copy'

    input:
    path reference

    output:
    file "${reference.baseName}.fasta.amb"
    file "${reference.baseName}.fasta.ann"
    file "${reference.baseName}.fasta.bwt"
    file "${reference.baseName}.fasta.pac"
    file "${reference.baseName}.fasta.sa"

    script:
    """
    bwa index -a bwtsw $reference
    """
}

process bwaMap {
    publishDir params.outdir, mode: 'copy'
    cpus 4
    input:
    path reference
    tuple val(trimmed_reads), path(trimmed_ch)

    output:
    file "final.sam"

    script:
    """
    bwa mem -t $task.cpus $reference ${trimmed_reads[0]} ${trimmed_reads[1]} > final.sam
   """
}


/*
 * Define the workflow
 */
workflow {

    Channel
        .fromFilePairs(params.reads, checkIfExists: true)
        .set { read_pairs_ch }

   // Process read pairs through trimSequences
    trimmed_ch = trimSequences(read_pairs_ch)
    bwaIndex(params.ref_genome)
    bwaMap(params.ref_genome, trimmed_ch)
}

So the first process "trimSequences" works. This basically takes a pair of fastq files for a sample in the directory /home/ which contains pairs of fastq files for several samples, and trim each pair giving the output the correct output name I want. So for each pair of input files, I produce a pair of output files. Then I copy them in the current directory.

Second process, the bwa index works (bwaIndex). But I think my code is not very elegant at all. At the end of this process, I also copy the output (the bwa index files produced for my reference, in my current working directory). The index files are automatically generated by the function bwa index -a bwtsw, by appending suffixes to the reference fasta name, and I need them in this format for next step in the same location as the reference fasta.

The main issue is the third process, the actual mapping process, bwaMap. First, it should only start after bwaIndex has finished. However I don't know how to enforce that as bwa mem wants as input just the fasta file of the reference (same as bwaIndex process) and then it searches the index files in the same directory based on the same name as the reference plus suffixes. This is also the reason why in bwaIndex at the end I copy the index files in the current directory together with the reference. So bwaMap cant find the reference index files.

Then the other issue is that bwaMap is not taking the output of trimSequences correctly. Either I mispecified in the workflow section, or the input and output section of the bwaMap process. I have tried several ways, but i can't find a solution. when I have the indexes in the current directory (manually copying them there) the bwaMap process fails as it is not taking correctly the trimmed pairs of fastq as input (indicating not reading correctly the trimmed reads pairs, output of trimSequences).

Then what I am trying to do is simply use bwa to map each pair of trimmed reads (trimmed r1 and trimmed r2 for each sample), output of trimSequences to the reference genome.

error:


Caused by:
 Process `bwaMap (1)` terminated with an error exit status (1)

Command executed:

 bwa mem -t 1 GCF_000001735.4_TAIR10.1_genomic.fasta R A > final.sam

Command exit status:
 1

Command output:
 (empty)

Command error:
 [E::bwa_idx_load_from_disk] fail to locate the index files

Nextflow bioinformatics snp calling beginner question: pair of files output of one process as input of another process

Answers (1)

Related Questions