Reputation: 305
I have a file with FASTA sequences like this
>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG
I am trying to extract sequence start with ATG and end with TAG/TGA/TAA. I am using this code for my purpose
#!/usr/bin/perl -w
# This script reads several sequences and print the sequence which don't strat with ATH and ends with TAG/TGA/TAA
use strict;
my $infile = "id.fasta";# This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "full_length_seq.txt";# This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line;# This removes "\n" at the end of each line (this is invisible)
if($line =~m/^ATG[GTAC]+T(GA|AA|AG)$/g) { # This finds lines matching with pattern
next;
}
print OUTFILE $line, "\n";
}
Which produce the result like this
>seq002
>seq0009
>seq0001
>seq0003
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG
But I want to create two different files like this
>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA
and
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG
any hint or help will be really appreciated. Thanks.
Upvotes: 1
Views: 63
Reputation: 53478
So the trick is - you're currently splitting on linefeed and working line by line.
But you don't have to - you can use $/
instead, and set it to a suitable delimiter.
I would suggest for this, you want "\n>"
, because then it'll take your things in chunks.
And then, you need to alter your pattern match slightly, because now each 'chunk' is two lines.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
open my $outfile, '>', 'full_length_seq.txt' or die $!;
open my $other_outfile, '>', 'everything_else.txt' or die $!;
while ( <DATA> ) {
chomp;
s/^>//g; #remove leading >, because first line doesn't have a linefeed in front.
#just for some diagnostics - print what we're currently operating on.
print "New chunk:\n";
print;
print "\nEnd\n";
if ( /\nATG[GTAC]+T(GA|AA|AG)$/ ) {
print "**matches**\n";
print {$other_outfile} ">",$_,"\n";
}
else {
print {$outfile} ">", $_, "\n";
}
}
__DATA__
>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG
And this gives us one file with:
>seq002
ATGGTAAATGGTTTCTCAAATTGTGCACTGACAGACAAACCCCT
>seq00005
GCTAATTTTGAGGTTTACCCAGATAGCTG
And the other with:
>seq0009
ATGGCGTCAAAGGTGATGCCGTCAGCGTCAACAACTAA
>seq0001
ATGGGAAATAGTGAGGACGGGAAATCTTTAG
>seq0003
ATGGGATCTTACTTGAACTTCAAGAATTGA
I used __DATA__
above as illustration - you should probably either still read an input file, or just use <>
to read "STDIN or file named on command line" (like grep
/sed
etc.)
Additionally I would suggest using 3 argument open with lexical file handles as a better style. E.g.
open ( my $infile, '<', 'id.fasta' ) or die $!;
Upvotes: 1