jasongallant
jasongallant

Reputation: 91

Extracting DNA sequences from FASTA file with BioPerl with non-standard header

I'm trying to extract sequences from a database using the following code:

use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";

Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:

>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA

The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?

Upvotes: 2

Views: 2564

Answers (2)

user19758
user19758

Reputation: 151

I have related question to this post. I was wondering if anyone has tried what happens when the position in the query is beyond the outside the limit of the fasta position. So lets say, the fasta contains 100 bases and you query contains position 102, does this method trap the error. I tried this in some real data and it appears to always return "1", however, my fasta sequences contains 0/1 and so it is hard to understand if this is an error code/ it is returning the output for the wrong base.

I tried looking in the documentation but could not find anything.

Upvotes: 0

Borodin
Borodin

Reputation: 126722

By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.

I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.

This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.

use strict;
use warnings;

use Bio::DB::Fasta;

my ($file, $id, $start, $end) = qw(
  secondround_merged_expanded.fasta
  C7136661
  1 10
);

my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);

sub makeid {
  my ($head) = @_;
  $head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
  $1;
}

my $seq = $db->seq($id, $start, $end);
print $seq, "\n";

Upvotes: 3

Related Questions