Is there a way to access/modify the contents of a Nextflow channel?

Question

I have a situation where my workflow outputs a main directory, which I emit from a process using DSL2. I feed this output to a python script, which can easily loop over the sub-directories and their respective files, pulling out information and compiling it into a .tsv

Two important pieces of information the python script is getting, is the name of the subdirectory and which file is actually important within the subdirectory.

I would like to take my process output ("root dir") + subdirectory (from file) + important filename (from file) and make it into a new generator path to feed to another process.

Am I just using a bad method? Is there a better way to access a generator? In the documentation I saw subscribe, but I haven't had luck using this functionality. Thank you in advance.

Example .tsv file (column 1 and 3 are what I want to append to generator)

GCF_000005845.2 Escherichia coli str. K-12 substr. MG1655, complete genome      GCF_000005845.2_ASM584v2_genomic.fna
GCF_000008865.2 Escherichia coli O157:H7 str. Sakai DNA, complete genome        GCF_000008865.2_ASM886v2_genomic.fna

Work directory structure

├── c6
│   └── 6598d4838f61d0421f03216990465c
│       ├── ecoli
│       │   ├── README.md
│       │   └── ncbi_dataset
│       │       ├── data
│       │       │   ├── GCF_000005845.2
│       │       │   │   ├── GCF_000005845.2_ASM584v2_genomic.fna
│       │       │   │   ├── genomic.gff
│       │       │   │   ├── protein.faa
│       │       │   │   └── sequence_report.jsonl
│       │       │   ├── GCF_000008865.2
│       │       │   │   ├── GCF_000008865.2_ASM886v2_genomic.fna
│       │       │   │   ├── genomic.gff
│       │       │   │   ├── protein.faa
│       │       │   │   └── sequence_report.jsonl
│       │       │   ├── assembly_data_report.jsonl
│       │       │   └── dataset_catalog.json
│       │       └── fetch.txt

Here is my nextflow script (constructive criticism very welcome):

#!/usr/bin/env Nextflow

nextflow.enable.dsl=2

workflow {

  //ref_genome_ch = Channel.fromPath("$params.ref_genome")
  println([params.taxon, params.zipName, params.unzippedDir])
  DOWNLOAD_ZIP(params.taxon, params.zipName)
  UNZIP(DOWNLOAD_ZIP.out.zipFile)
  REHYDRATE(UNZIP.out.unzippedDir)
  COLLECT_NAMES(REHYDRATE.out.dataDir)


  // I want to get the dir name and file name out of
  // relations.txt
  //thing = Channel.from(  )
  //thing.view()
  //organism_genomes = REHYDRATE.out.dataDir.subscribe { println("$it/")}

}

process DOWNLOAD_ZIP {
  errorStrategy 'ignore'

  input:
  val taxonName
  val zipName

  output:
  path "${zipName}" , emit: zipFile

  script:
  def reference = params.reference
  """
  datasets download genome \
     taxon '${taxonName}' \
     --dehydrated \
     --filename ${zipName} \
     ${reference} \
     --exclude-genomic-cds
  """

}


process UNZIP {
  input:
  path zipFile

  output:
  path "${zipFile.baseName}" , emit: unzippedDir

  script:
  """
  unzip $zipFile -d ${zipFile.baseName}
  """

}


process REHYDRATE {
  input:
  path unzippedDir

  output:
  path "$unzippedDir/ncbi_dataset/data" , emit: dataDir

  script:
  """
  datasets rehydrate \
     --directory $unzippedDir
  """
}



process COLLECT_NAMES {
  publishDir params.results

  input:
  path dataDir

  output:
  path "relations.txt" , emit: org_names

  script:
  """
  python "$baseDir/bin/collect_org_names.py" $dataDir
  """

}

Edit: User @Steve recommended channel operators. I don't fully understand the groovy {thing -> stuff} syntax yet, but I tried to do this:

thing = REHYDRATE.out.dataDir.map{"$it/*"}
thing.view()

and I get

/mnt/c/Users/mkozubov/Desktop/nextflow_tutorial/tRNA_stuff/work/d0/long_hash/ecoli/ncbi_dataset/data/*

printed... But when I feed this into a process that just has a script: println(input) I get an error saying that the command executed is null, command ouput is (empty) and that target '*' is not a directory.

My question is why didn't the .map operator expand the * as entering "PATH/*" into a channel would've?

Edit2: I feel like I almost had something. I changed the output of the COLLECT_NAMES script to contain the path to the files. I now want to parse this file and read the contents into a channel. For that I did

organism_genome_files = Channel.from()
  COLLECT_NAMES.out.org_names.map {
    new File(it.toString()).eachLine { line ->
      organism_genome_files << line.split('	')[3] }
  }

If I replace the organism_genome_files << line.split(' ')[3] with println line.split(' ')[3] I can see the content I want, but I can't seem to find a way of pulling this info out.

I also tried it with organism_genome_files as a list, but nothing seems to be working, I just can't seem to pull info from channels and effectively mutate it.

The .splitCSV() method seems like it could be useful, but I still don't understand how to get a channel to work as an input to another channel :(

Is there a way to access/modify the contents of a Nextflow channel?

Answers (1)

Related Questions