Qba Liu
Qba Liu

Reputation: 73

Nextflow script to process all files in given directory

I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.

#!/usr/bin/env nextflow

nextflow.enable.dsl=2


// here I tried to use globbing

params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"

params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"


file_channel = Channel.fromPath( params.input_files, checkIfExists: true )


// how can I make this process work on two files simultanously

process FILTERING {

    publishDir("${params.results_dir}/after_filtering", mode: 'copy')

    input:
    path(input_files)

    output:
    path("*")

    script:
    """
    vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode  --recode-INFO-all  --out after_filtering.vcf
    """
}

Upvotes: 2

Views: 1873

Answers (2)

ATpoint
ATpoint

Reputation: 878

Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.

#! /usr/bin/env nextflow

nextflow.enable.dsl = 2

params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"

// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
                      .map { it -> [it.baseName, it] }

// An example process just head-ing the vcf
process VcfHead {

    publishDir("${params.results_dir}/after_filtering", mode: 'copy')

    input:
    tuple val(name), path(vcf_in)

    output:
    path("*_head.vcf")

    script:
    """ 
    head -n 1 $vcf_in > ${name}_head.vcf
    """

}                      

// Run it
workflow {

    VcfHead(file_channel)

}

The file_channel channel looks like this if you add a .view() to it:

[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]

Upvotes: 3

Steve
Steve

Reputation: 54502

Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:

params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"


process FILTERING {

    tag { sample }

    publishDir("${params.results_dir}/after_filtering", mode: 'copy')

    input:
    tuple val(sample), path(indexed_vcf)

    output:
    tuple val(sample), path("${sample}.filtered.vcf")

    """
    vcftools \\
        --vcf "${indexed_vcf.first()}" \\
        --mac 1 \\
        --minQ 20 \\
        --recode \\
        --recode-INFO-all \\
        --out "${sample}.filtered.vcf"
    """
}

workflow {

    vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )

    FILTERING( vcf_files ).view()
}

Results:

$ nextflow run main.nf
N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor >  local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]

Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.

Upvotes: 4

Related Questions