Rob
Rob

Reputation: 175

Record separator within a record separator

How can I make use of a record separator, and then simultaneously use a sub-record separator? Perhaps that isn't the best way to think about what I am trying to do. Here is my goal:

I want to perform a while loop on a single tab delimitated item at a time, in a specified row of items. For every line (row) of tab separated items, I need to print the outcomes of all the while loops into a unique file. Allow the following examples to help clarify.

My input file will be something like the following. It will be called "Clustered_Barcodes.txt"

    TTTATGC TTTATGG TTTATCC TTTATCG
    TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
    CTTGTAA 

My perl code looks like the following:

    #!/usr/bin/perl
    use warnings;
    use strict;

    open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;

    my %hash = (
            "TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
            "TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
            "TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
            "TTTATCG" => "TAGCTAGCTTTATCGCGTACGTA",
            "TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
            "TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
            "TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
            "TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
            "TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
            "CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
    );

    while(<INFILE>) {
            $/ = "\n";
            my @lines = <INFILE>;
            open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
            foreach my $sequence (@lines){
                   if (exists $hash{$sequence}){
                   print $out ">$sequence\n$hash{$sequence}\n";
                   }
            }
   } 

My desired output would be three different files. The first file will be called "Clustered_Barcode_1.fasta" and will look like:

    >TTTATGC
    TATAGCGCTTTATGCTAGCTAGC 
    >TTTATGG 
    TAGCTAGCTTTATGGGCTAGCTA 
    >TTTATCC
    GCTAGCTATTTATCCGCTAGCTA
    >TTTATCG
    TAGCTAGCTTTATCGCGTACGTA 

Note that this is formatted so that the keys are preceded by a carrot, and then on the next line is the longer associated sequence (value). This file includes all the sequences in the first line of Clustered_Barcodes.txt

My third file should be named "Clustered_Barcode_3.fasta" and look like the following:

    >CTTGTAA 
    ATCGATCGCTTGTAACGATTAGC 

When I run my code, it only takes the second and third lines of sequences in the input file. How can I start with the first line (by getting rid of the \n requirement for a record separator)? How can I then process each item at a time and then print the line's worth of results into one file? Also, if there is a way to incorporate the number of sequences into the file name, that would be great. It would help me to later organize the files by size. For example, the name could be something like "Clusterd_Barcodes_1_File_3_Sequences.fasta".

Thank you all.

Upvotes: 2

Views: 141

Answers (2)

ysth
ysth

Reputation: 98398

There's no need to read in the whole file that I see here. You just need to loop over the contents of each line:

    while(my $line = <INFILE>) {
        chomp $line;
        open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
        foreach my $sequence ( split /\t/, $line ){
            if (exists $hash{$sequence}){
                print $out ">$sequence\n$hash{$sequence}\n";
            }
        }
    }

Upvotes: 2

melpomene
melpomene

Reputation: 85837

OK, so here's one way to do it:

#!/usr/bin/perl
use strict;
use warnings;

Standard preamble.

my %hash = (
    "TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
    "TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
    "TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
    "TTTATCG" => "TAGCTAGCTTTATCGCGTACGTA",
    "TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
    "TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
    "TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
    "TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
    "TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
    "CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);

Set up the hash of sequences.

my $infile = 'Clustered_Barcodes.txt';
open my $infh, '<', $infile or die "$0: $infile: $!\n";

Open file for reading.

chomp(my @rows = readline $infh);
my $row_count = @rows;

Slurp all lines into memory in order to get the number of sequences. If you have too many sequences, this approach is not going to work (because you'll run out of memory (but that depends on how much RAM you have)).

my $i = 1;
for my $row (@rows) {

Loop over the lines.

    my @fields = split /\t/, $row;

Split each line into fields separated by tabs.

    my $outfile = "Clustered_Barcodes_${i}_File_${row_count}_Sequences.fasta";
    $i++;
    open my $outfh, '>', $outfile or die "$0: $outfile: $!\n";

Open current output file and increment counter.

    for my $field (@fields) {
        print $outfh ">$field\n$hash{$field}\n" if exists $hash{$field};
    }

Write each field (and its mapping) to outfile.

}

And we're done. The main difference to your original code is using split /\t/ and foreach to loop over fields within a line.


We can do it without slurping, too:

while (my $row = readline $infh) {
    chomp $row;

Loop over the lines, one by one. This replaces the 4 lines from chomp(my @rows = readline $infh); to for my $row (@rows) {.

But now we've lost the $i and $row_count variables, so we have to change the initialization of $outfile:

    my $outfile = "Clustered_Barcodes_$..fasta";

That should be all the changes you need. (You can get $row_count back in this scenario by reading $infh twice (the first time just for counting, then seeking back to the start); this is left as an exercise for the reader.)

Upvotes: 3

Related Questions