Diesel
Diesel

Reputation: 43

Nextflow bioinformatics snp calling beginner question: pair of files output of one process as input of another process

I have just started learning nextflow. I have been following the rna seq tutorial provided here: https://training.nextflow.io/basic_training/rnaseq_pipeline/#quality-control but there are a couple of things that I cannot reimplement in my own workflow. I am slowly trying to build a SNP calling workflow from a pipeline I use often, just as a test for me to learn how to use nextflow.

This is my code, it uses two containers which are specified in the nexflow.config file.


params.reads = "/home/*_R{1,2}*.gz"
params.outdir = "./"
params.ref_genome = "/home/GCF_000001735.4_TAIR10.1_genomic.fasta"

log.info """\
    S N P   C A L L I N G   P I P E L I N E
    ===================================
    reference    : ${params.ref_genome}
    reads        : ${params.reads}
    outdir       : ${params.outdir}
    """
    .stripIndent(true)


/*
 * Processes
 */
process trimSequences {
    tag "Fastp on $sample_id"
    publishDir params.outdir, mode: 'copy'
    cpus 4
    debug true

    input:
    tuple val(sample_id), path(reads)


    output:
    tuple val(sample_id), path("${sample_id}_{1,2}_trimmed.fastq.gz")

    script:
    """
    fastp -w $task.cpus -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_1_trimmed.fastq.gz -O ${sample_id}_2_trimmed.fastq.gz
    """

}



process bwaIndex {
    publishDir params.outdir, mode: 'copy'

    input:
    path reference

    output:
    file "${reference.baseName}.fasta.amb"
    file "${reference.baseName}.fasta.ann"
    file "${reference.baseName}.fasta.bwt"
    file "${reference.baseName}.fasta.pac"
    file "${reference.baseName}.fasta.sa"

    script:
    """
    bwa index -a bwtsw $reference
    """
}

process bwaMap {
    publishDir params.outdir, mode: 'copy'
    cpus 4
    input:
    path reference
    tuple val(trimmed_reads), path(trimmed_ch)

    output:
    file "final.sam"

    script:
    """
    bwa mem -t $task.cpus $reference ${trimmed_reads[0]} ${trimmed_reads[1]} > final.sam
   """
}


/*
 * Define the workflow
 */
workflow {

    Channel
        .fromFilePairs(params.reads, checkIfExists: true)
        .set { read_pairs_ch }

   // Process read pairs through trimSequences
    trimmed_ch = trimSequences(read_pairs_ch)
    bwaIndex(params.ref_genome)
    bwaMap(params.ref_genome, trimmed_ch)
}

So the first process "trimSequences" works. This basically takes a pair of fastq files for a sample in the directory /home/ which contains pairs of fastq files for several samples, and trim each pair giving the output the correct output name I want. So for each pair of input files, I produce a pair of output files. Then I copy them in the current directory.

Second process, the bwa index works (bwaIndex). But I think my code is not very elegant at all. At the end of this process, I also copy the output (the bwa index files produced for my reference, in my current working directory). The index files are automatically generated by the function bwa index -a bwtsw, by appending suffixes to the reference fasta name, and I need them in this format for next step in the same location as the reference fasta.

The main issue is the third process, the actual mapping process, bwaMap. First, it should only start after bwaIndex has finished. However I don't know how to enforce that as bwa mem wants as input just the fasta file of the reference (same as bwaIndex process) and then it searches the index files in the same directory based on the same name as the reference plus suffixes. This is also the reason why in bwaIndex at the end I copy the index files in the current directory together with the reference. So bwaMap cant find the reference index files.

Then the other issue is that bwaMap is not taking the output of trimSequences correctly. Either I mispecified in the workflow section, or the input and output section of the bwaMap process. I have tried several ways, but i can't find a solution. when I have the indexes in the current directory (manually copying them there) the bwaMap process fails as it is not taking correctly the trimmed pairs of fastq as input (indicating not reading correctly the trimmed reads pairs, output of trimSequences).

Then what I am trying to do is simply use bwa to map each pair of trimmed reads (trimmed r1 and trimmed r2 for each sample), output of trimSequences to the reference genome.

error:


Caused by:
 Process `bwaMap (1)` terminated with an error exit status (1)

Command executed:

 bwa mem -t 1 GCF_000001735.4_TAIR10.1_genomic.fasta R A > final.sam

Command exit status:
 1

Command output:
 (empty)

Command error:
 [E::bwa_idx_load_from_disk] fail to locate the index files

Upvotes: 0

Views: 82

Answers (1)

Diesel
Diesel

Reputation: 43

So the bwaIndex issue -- I wanted bwaMap to wait for bwaIndex to finish before starting, and that was easy to resolve, it wasn't my reason for posting but was indeed creating problems. I simply added the outputs of bwaIndex as input of bwaMap in the processes specifications. Then in the workflow section I simply saved the output of bwaIndex, and fed it as input of bwaMap.

The main issue and reason for posting was that I could not understand why the output of trimSequences was not read properly by bwaMap. I knew that it was not read properly because everytime it threw an error the bwa mem command run was reported, and I could see that the trimmed reads name were not read properly in the command reported in the error. Such as in the error that I reported in my question, I had "R" and "A" rather than the trimmed reads full names/paths. The issue was in the way I was specifying the input in bwaMap process. I fixed it as shown in the code below. Now it all works, but probably in a not very elegant way. Any suggestion on how to improve this syntax is still welcome. I am still finding my way around this and not sure I fully understand when to use path, file and the various input types, but will get there. My solution:

#!/usr/bin/env nextflow

params.reads = "/home/*_R{1,2}*.gz"
params.outdir = "./"
params.ref_genome = "/home/GCF_000001735.4_TAIR10.1_genomic.fasta"

log.info """\
    S N P   C A L L I N G   P I P E L I N E
    ===================================
    reference    : ${params.ref_genome}
    reads        : ${params.reads}
    outdir       : ${params.outdir}
    """
    .stripIndent(true)


/*
 * Processes
 */
process trimSequences {
    tag "Fastp on $sample_id"
    cpus 4


    input:
    tuple val(sample_id), path(reads)


    output:
    tuple val(sample_id), path("${sample_id}_{1,2}_trimmed.fastq.gz")

    script:
    """
    fastp -w $task.cpus -i ${reads[0]} -I ${reads[1]} -o ${sample_id}_1_trimmed.fastq.gz -O ${sample_id}_2_trimmed.fastq.gz
    """

}



process bwaIndex {

    input:
    path reference

    output:
    file "${reference.baseName}.fasta.amb"
    file "${reference.baseName}.fasta.ann"
    file "${reference.baseName}.fasta.bwt"
    file "${reference.baseName}.fasta.pac"
    file "${reference.baseName}.fasta.sa"

    script:
    """
    bwa index -a bwtsw $reference
    """
}

process bwaMap {

    publishDir params.outdir, mode: 'copy'
    cpus 4

    input:
    path reference
    file "${reference.baseName}.fasta.amb"
    file "${reference.baseName}.fasta.ann"
    file "${reference.baseName}.fasta.bwt"
    file "${reference.baseName}.fasta.pac"
    file "${reference.baseName}.fasta.sa"
    tuple val(sample_id), path(trimmed_reads)

    output:
    file "final.sam"

    script:
    """
    bwa mem -t $task.cpus $reference ${trimmed_reads[0]} ${trimmed_reads[1]} > final.sam
    """
}



/*
 * Define the workflow
 */
workflow {

    Channel
        .fromFilePairs(params.reads, checkIfExists: true)
        .set { read_pairs_ch }


    trimmed_ch = trimSequences(read_pairs_ch)
    bwa_index = bwaIndex(params.ref_genome)
    bwaMap(params.ref_genome, bwa_index, trimmed_ch)
}


Upvotes: 0

Related Questions