Extracting DNA sequences from FASTA file with BioPerl with non-standard header

Question

I'm trying to extract sequences from a database using the following code:

use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"
";

Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:

>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA

The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?

Borodin · Accepted Answer

By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.

I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.

This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.

use strict;
use warnings;

use Bio::DB::Fasta;

my ($file, $id, $start, $end) = qw(
  secondround_merged_expanded.fasta
  C7136661
  1 10
);

my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);

sub makeid {
  my ($head) = @_;
  $head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
  $1;
}

my $seq = $db->seq($id, $start, $end);
print $seq, "
";

Extracting DNA sequences from FASTA file with BioPerl with non-standard header

Answers (2)

Related Questions