zillur rahman
zillur rahman

Reputation: 395

Merge multiple output chunks to one file in nextflow

I have a nextflow process that outputs multiple files, like below:

[chr1,/path/to/chr1_chunk1.TC.linear]
[chr1,/path/to/chr1_chunk1.HDL.linear]
[chr1,/path/to/chr1_chunk2.TC.linear]
[chr1,/path/to/chr1_chunk2.HDL.linear]
.....

The above example I got after using transpose() operator.

Now, I want to concatenate All chunks and all chromosome together ordered by chunk and chromosome number so that I get 1 file for TC and another file for HDL. I have multiple traits in many chunks so this link wouldn't be helpful. output files (chromosomal chunks) merging in nextflow Any help?

Upvotes: 2

Views: 1584

Answers (2)

Steve
Steve

Reputation: 54502

If your chunk files are sufficiently small, you can use the collectFile operator to concatenate them into files with names defined using a dynamic grouping criteria:

The grouping criteria is specified by a closure that must return a pair in which the first element defines the file name for the group and the second element the actual value to be appended to that file.

To sort by chromosome number and then by chunk number, you can use the toSortedList and flatMap operators to feed the sorted collection into the collectFile operator:

input_ch
    .map { key, chunk_file ->
        def matcher = chunk_file.name =~ /^chr(\d+)_chunk(\d+)\.(\w+)\.linear$/
        def (_, chrom, chunk, trait) = matcher[0]

        tuple( (chrom as int), (chunk as int), trait, chunk_file )
    }
    .toSortedList( { a, b -> (a[0] <=> b[0]) ?: (a[1] <=> b[1]) } )
    .flatMap()
    .collectFile( sort: false ) { chrom, chunk, trait, chunk_file ->
          [ "${trait}.linear", chunk_file.text ]
    }

Upvotes: 2

mribeirodantas
mribeirodantas

Reputation: 489

You can use a combination of the branch and collectFile operators. Look at the following directory structure below (where the .linear files have their names as contents):

➜  sandbox tree .
.
├── ex1.HDL.linear
├── ex1.TC.linear
├── ex2.HDL.linear
├── ex2.TC.linear
├── ex3.HDL.linear
├── ex3.TC.linear
└── example.nf

I wrote the following minimal reproducible example:

workflow {
  files = Channel.fromPath('**.linear', checkIfExists: true)
  files
    .branch {
      TC: it.name.contains('TC')
      HDL: it.name.contains('HDL')
    }
    .set { result }
  result
    .TC
    .collectFile(name: 'TC.txt', storeDir: '/Users/mribeirodantas/sandbox')
  result
    .HDL
    .collectFile(name: 'HDL.txt', storeDir: '/Users/mribeirodantas/sandbox')
}

After running this pipeline with nextflow run example.nf, I will get in the /Users/mribeirodantas/sandbox folder two new files: TC.txt and HDL.txt. The content of TC.txt, for example, is:

ex2.TC.linear
ex3.TC.linear
ex1.TC.linear

Upvotes: 4

Related Questions