How to save matched and nonmatched FASTA sequences

Question

I have a file with FASTA sequences like this

 >seq002
 ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
 >seq0009
 ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
 >seq0001
 ATGGGAAATAGTGAGGACGGGAAATCTTTAG
 >seq0003
 ATGGGATCTTACTTGAACTTCAAGAATTGA
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG

I am trying to extract sequence start with ATG and end with TAG/TGA/TAA. I am using this code for my purpose

#!/usr/bin/perl -w
# This script reads several sequences and print the sequence which don't strat with ATH and ends with TAG/TGA/TAA

use strict; 

my $infile = "id.fasta";# This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open

my $outfile = "full_length_seq.txt";# This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $line;                             # This reads the input file one-line-at-a-time

while ($line = ) {
    chomp $line;# This removes "
" at the end of each line (this is invisible)

    if($line =~m/^ATG[GTAC]+T(GA|AA|AG)$/g) { # This finds lines matching with pattern
        next;

      }

    print OUTFILE $line, "
";
}

Which produce the result like this

>seq002
>seq0009
>seq0001
>seq0003
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG

But I want to create two different files like this

>seq002
 ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
 >seq0009
 ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
 >seq0001
 ATGGGAAATAGTGAGGACGGGAAATCTTTAG
 >seq0003
 ATGGGATCTTACTTGAACTTCAAGAATTGA

and

>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG

any hint or help will be really appreciated. Thanks.

Sobrique · Accepted Answer

So the trick is - you're currently splitting on linefeed and working line by line.

But you don't have to - you can use $/ instead, and set it to a suitable delimiter.

I would suggest for this, you want " >", because then it'll take your things in chunks.

And then, you need to alter your pattern match slightly, because now each 'chunk' is two lines.

Something like this:

#!/usr/bin/env perl

use strict;
use warnings;

local $/ = "
>";

open my $outfile, '>', 'full_length_seq.txt' or die $!;
open my $other_outfile, '>', 'everything_else.txt' or die $!;


while (  ) { 
    chomp;
    s/^>//g; #remove leading >, because first line doesn't have a linefeed in front. 

    #just for some diagnostics - print what we're currently operating on. 
    print "New chunk:
";
    print;
    print "
End
";


    if ( /
ATG[GTAC]+T(GA|AA|AG)$/ ) {
        print "**matches**
";
        print {$other_outfile} ">",$_,"
";
    }
    else {
        print {$outfile} ">", $_, "
";
    }

}

__DATA__
>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG

And this gives us one file with:

>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG

And the other with:

>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA

I used __DATA__ above as illustration - you should probably either still read an input file, or just use <> to read "STDIN or file named on command line" (like grep/sed etc.)

Additionally I would suggest using 3 argument open with lexical file handles as a better style. E.g.

open ( my $infile, '<', 'id.fasta' ) or die $!;

How to save matched and nonmatched FASTA sequences

Answers (1)

Related Questions