Reputation: 25

Nextflow script for performing quality control and trimming the sequence

I am new to Nextflow scripts. I am trying to build a mitochondrial DNA variant pipeline. I have used fastqc and trimmomatic tool for quality checking and trimming a low quality sequences. I have written a script below, program is executed but shows no output.

#!/usr/bin/env nextflow

params {
  fastq_dir = "/mnt/e/Bioinformatics_ppt_learning/mtDNA/nextflow_scripts/*.fastq.gz"
  fastqc_dir = "/mnt/e/Bioinformatics_ppt_learning/mtDNA/nextflow_scripts/fastqc_report"
  trimmed_dir = "/mnt/e/Bioinformatics_ppt_learning/mtDNA/nextflow_scripts/trimmed_fastq"
  trimmomatic_jar = "/mnt/e/Bioinformatics_ppt_learning/mtDNA/nextflow_scripts/trimmomatic-0.39.jar"
}

process FastQC {
  tag "Running FastQC on ${fastq}"
  publishDir "${fastqc_dir}/${fastq.baseName}"
  input: path fastq
  script:
    """
    fastqc -o ${fastqc_dir} ${fastq}
    """
}

process Trimmomatic {
  tag "Trimming ${fastq.baseName}"
  input:
    path read1 from FastQC.output

  output:
    file(joinPath(trimmed_dir, "${read1.baseName}_trimmed.fastq.gz"))

  script:
    """
    java -jar ${params.trimmomatic_jar} PE -threads 4 \
      ${read1} ${joinPath(trimmed_dir, "${read1.baseName}_trimmed.fastq.gz")} \
      ${joinPath(trimmed_dir, "${read1.baseName}_unpaired.fastq.gz")} \
      ${joinPath(trimmed_dir, "${read1.baseName}_unpaired.fastq.gz")}
    """
}

workflow {
  fastq_files = Channel.fromPath(params.fastq_dir)

  fastq_files.each {
    FastQC(fastq: it)
    Trimmomatic(read1: FastQC.output)
  }
}

Upvotes: 0

Answers (2)

Valentin Ruano

Reputation: 2809

There are a few issues with your code. I will address here what I see but most likely is not going to fix everything just get you closer to first working version.

The most prominent issue with your code is the 'each' loop that you have in the main workflow; that shows a common misconception for nextflow beginners that tend to see processes as if they were java/groovy methods. They are rather singleton objects that are connected thru data channels. The workflow code main propose is to declare channel connections between the processes.

So instead of

workflow {
  fastq_files = Channel.fromPath(params.fastq_dir)

  fastq_files.each {
    FastQC(fastq: it)
    Trimmomatic(read1: FastQC.output)
  }
}

you owe to write something like this:

workflow {
      fastq_files = Channel.fromPath(params.fastq_dir)
      FastQC(fastq_files)
      Trimmomatic(FastQC.out)
}

Or since your processes have simple outputs that feed into each other you co do this:

workflow {
  Channel.fromPath(params.fastq_dir) | FastQC | Trimmomatic
}

However in practice with realistic pipelines things get complicated quickly (several output per process) and you may need to revert to the longer non-piping form above.

As it is already done in the workflow code above you don't need to link the input of Trimmomatic to the output of FastQ explicitly in the definition of the input read1; that is old style nextflow and make more difficult to reuse processes across pipelines.
Your processes do not define outputs so it comes to no surprise that nothing gets publish and frankly it should not work at all. So please add the corresponding output: sections as indicated in Nextflow documentation.
At least in FastqQC you try to specify the output publishing location twice using the publishDir directive (the correct way) and then the output path of the actual output files/directories in each process script itself using absolute paths (wrong). Fix: 4.a Keep the publishDir in FastqQ, and one for Trimmomatic, 4.b Each script file should generate output using relative paths in the process working directory. 4.c change or add an output: section in each process indicating the name of the output files so that these get published based on the information in the publishDir directive.

The following are rather minor points and perhaps even not existing problems, just style:

Not sure if .output would actually work to refer to a process output. In my experience and based on docs .out should do.
Channel.fromPath seem to be provided a param fastq_dir that invites to be interpreted as the directory containing the fastq rather than a list of fastqs. Instead of expecting the user to add the required wildcard characters for expansion as is done in its default value, I would add such wildcards in the code like so:

Channel.fromPath("${params.fastq_dir}/**/*.fastq.gz")

Upvotes: 0

dthorbur

Reputation: 1091

publishDir works by emitting items in the process output declaration to the path provided. You haven't provided an output declaration for either process, so it doesn't think there is anything to publish.

Also, unless you're using it for checkpointing, you don't need the output from FastQC for Trimmomatic, you can get the two processes to run in parallel.

Don't use joinPath or any absolute path in your processes. That's not what Nextflow is designed for, and often will lead to errors. Plus, by putting an absolute path in the output declaration, you're telling the process to look in the output directory for the file generated in the process. Use publishDir to emit files.

The file operator is deprecated. Use path instead. The documentation is amazing for nextflow. It's a steep learning curve, but it's very good at describing how things work.

So here is an updated script:

process FastQC {
  tag "Running FastQC on ${sampleid}"

  publishDir {
    path: "${params.fastqc_dir}/${fastq.baseName}",
    move: 'move',
  }

  input: 
    tuple val(sampleid), path(fastq)

  output:
    path("*.html")

  script:
    """
    fastqc ${fastq}
    """
}

process Trimmomatic {
  tag "Trimming ${sampleid}"

  publishDir {
    path: "${params.trimmed_dir}",
    move: 'copy',
  }
  
  input:
    tuple val(sampleid), path(fastq)

  output:
    path("*_trimmed.fastq.gz")

  script:
    """
    java -jar ${params.trimmomatic_jar} PE -threads 4 \
      ${fastq} ${sampleid}_trimmed.fastq.gz")} \
      ${sampleid}_unpaired.fastq.gz")} \
      ${sampleid}_unpaired.fastq.gz")}
    """
}

In the workflow, you shouldn't need to tell the processes to iterate over each element. This is the default behaviour of the tool. I've added some commands to the channel generation to highlight some redundancy you can add.

Channel
  .fromPath(${params.fastq_dir}/*{.fastq.gz,.fq.gz,.fastq,.fq})
  .map { it -> tuple( it.simpleName, it ) }
  .ifEmpty { error "Cannot find any fastq files in ${params.fastq_dir}" }
  .set { fastq_files }

workflow {
  FastQC(fastq_files)
  Trimmomatic(fastq_files)
}

EDIT: Missed some of the absolute paths. Updated input to be a tuple instead since it's better at handing names this way and adjusted tags.

Upvotes: 1

Nextflow script for performing quality control and trimming the sequence

Answers (2)

Related Questions