Reputation: 690
I have a pair of fasta files, that I want to split into smaller chunks to parallelize the processing.
The first fasta reads.fasta
contains DNA sequences
>/kingdoms/rce/workspace1/Nanopore/20180223-run9/RawData/BC-BD-chr10/i0013771_20180416_FAH66366_MN19358_sequencing_run_1042_63976_read_126980_ch_412_strand.fast5_template_deepnano {'mapped_end': 24599, 'num_matches': 22704, 'mapped_strand': '+', 'clipped_bases_end': 18, 'num_insertions': 715, 'mapped_start': 226, 'mapped_chrom': 'chr10', 'num_mismatches': 795, 'clipped_bases_start': 154, 'num_deletions': 874}
CXXACCCGGAGXXXCAGCXAAAAGCXAXACXXACXACCXXTAXXXTATGXXXACXXXXXAXAGACXGTCXXXXCAXCCXACXCCTXCGCACTTGXCXCXCGCXACXGCCGXGCAACAAACACXAAAXCAAAACAGXAAAAXACXACAXCAAAACGCATAXXCCCXAGAAAAAAAXXXTCXXACAATAXACXAXACXACACAAXACABAAXCAGXGACXXXCGXAACAACAAXXXCCTXCACXCXCCAACTXCXCXGCXCGAAXCCCXACATAAXAATATAXCAAAXCXACCGXCXGGAACAXCAXCGCXAXCCAGCXCXTTGXGAACCGCXACCAXCAGCABGXACAGXGGXACCCXCGTGXCAXCXGCAGCGAGAACTXCAACGXXXGCCAAAXCAAGCCAATGXGGXAACAACCACACC
>/kingdoms/rce/workspace1/Nanopore/20180223-run9/RawData/BC-BD-chr10/i0013771_20180416_FAH66366_MN19358_sequencing_run_1042_63976_read_55042_ch_362_strand.fast5_template_deepnano {'mapped_end': 202484, 'num_matches': 12382, 'mapped_strand': '-', 'clipped_bases_end': 33, 'num_insertions': 442, 'mapped_start': 189194, 'mapped_chrom': 'chr10', 'num_mismatches': 461, 'clipped_bases_start': 20, 'num_deletions': 447}
XGAXXXTAATGXTAAAXCGAXAGXACCAAGXCXXTTGTTGTAXACXAGAXCCAXXCCXAATATAXCTGTAXCGAGXACAXCGXCTAXXAATGXXCCTGXAAXXXXCAGXXCAAAAXXACXXXXCAAXTBGXXTAXGAAXXCAXCCAAXCXCTGXXCAXXGCXXGCCGCAAXXACGCAGXCAXCAACAXAGACXGCAAXCAXXAGAXXXXBAXCCXCGGXXXGGTAXAAXCCCGGAGTAXAAGAGXXATCXXXCAGXCCAAXXCCAXXCAAGTATTGTCXXAGAXGAXCAXXCCAXTCXXXAGGACXCTGXXXXAGACCATAXAACGCCXTAXXXAGCXXGACXACACAXCXCCXAXCAXGCGGATGXGGGATGTATAXXBCTTCTXCCAAXXXAGCATAXAGGAAXGCAXGAXXGA
...
The second fasta reads.fasta_values
contains a sequence of values separated by a space, that correspond to the DNA sequence of the reads.fasta
(in the same order)
>/kingdoms/rce/workspace1/Nanopore/20180223-run9/RawData/BC-BD-chr10/i0013771_20180416_FAH66366_MN19358_sequencing_run_1042_63976_read_126980_ch_412_strand.fast5_template_deepnano
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.03
>/kingdoms/rce/workspace1/Nanopore/20180223-run9/RawData/BC-BD-chr10/i0013771_20180416_FAH66366_MN19358_sequencing_run_1042_63976_read_55042_ch_362_strand.fast5_template_deepnano
0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
...
I want to get several pairs of smaller files.
Currently, I have tried to pair them, then split them, but that only splits the first file of the pair.
Channel
.fromFilePairs("reads{.fasta,.fasta_values}", flat:true)
.splitFasta(by: 1, file:true)
.println()
Output:
[reads, reads.1.fasta, reads.fasta_values]
[reads, reads.2.fasta, reads.fasta_values]
[reads, reads.3.fasta, reads.fasta_values]
While I want something like this
[reads, reads.1.fasta, reads.1.fasta_values]
[reads, reads.2.fasta, reads.2.fasta_values]
[reads, reads.3.fasta, reads.3.fasta_values]
I think something similar is doable with fastq
files for paired-end reads, but I could not find out how to do it with fasta
files.
Any help is appreciated,
Thanks.
Upvotes: 0
Views: 523
Reputation: 690
Ok I found it, just needed the argument elem
in splitFasta
Channel
.fromFilePairs("reads{.fasta,.fasta_values}", flat:true)
.splitFasta(by: 1, file:true, elem:[1,2])
.println()
Upvotes: 1