Record separator within a record separator

Question

How can I make use of a record separator, and then simultaneously use a sub-record separator? Perhaps that isn't the best way to think about what I am trying to do. Here is my goal:

I want to perform a while loop on a single tab delimitated item at a time, in a specified row of items. For every line (row) of tab separated items, I need to print the outcomes of all the while loops into a unique file. Allow the following examples to help clarify.

My input file will be something like the following. It will be called "Clustered_Barcodes.txt"

    TTTATGC TTTATGG TTTATCC TTTATCG
    TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
    CTTGTAA

My perl code looks like the following:

    #!/usr/bin/perl
    use warnings;
    use strict;

    open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;

    my %hash = (
            "TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
            "TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
            "TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
            "TTTATCG" => "TAGCTAGCTTTATCGCGTACGTA",
            "TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
            "TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
            "TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
            "TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
            "TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
            "CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
    );

    while() {
            $/ = "
";
            my @lines = ;
            open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
            foreach my $sequence (@lines){
                   if (exists $hash{$sequence}){
                   print $out ">$sequence
$hash{$sequence}
";
                   }
            }
   }

My desired output would be three different files. The first file will be called "Clustered_Barcode_1.fasta" and will look like:

    >TTTATGC
    TATAGCGCTTTATGCTAGCTAGC 
    >TTTATGG 
    TAGCTAGCTTTATGGGCTAGCTA 
    >TTTATCC
    GCTAGCTATTTATCCGCTAGCTA
    >TTTATCG
    TAGCTAGCTTTATCGCGTACGTA

Note that this is formatted so that the keys are preceded by a carrot, and then on the next line is the longer associated sequence (value). This file includes all the sequences in the first line of Clustered_Barcodes.txt

My third file should be named "Clustered_Barcode_3.fasta" and look like the following:

    >CTTGTAA 
    ATCGATCGCTTGTAACGATTAGC

When I run my code, it only takes the second and third lines of sequences in the input file. How can I start with the first line (by getting rid of the requirement for a record separator)? How can I then process each item at a time and then print the line's worth of results into one file? Also, if there is a way to incorporate the number of sequences into the file name, that would be great. It would help me to later organize the files by size. For example, the name could be something like "Clusterd_Barcodes_1_File_3_Sequences.fasta".

Thank you all.

ysth · Accepted Answer

There's no need to read in the whole file that I see here. You just need to loop over the contents of each line:

    while(my $line = ) {
        chomp $line;
        open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
        foreach my $sequence ( split /	/, $line ){
            if (exists $hash{$sequence}){
                print $out ">$sequence
$hash{$sequence}
";
            }
        }
    }

Record separator within a record separator

Answers (2)

Related Questions