Reputation: 147
I wrote a script below to analyze two file formats from bedtools (-tab option, -name option) so it could combine the headers if the sequences match. The problem I ran into was that if the sequences matched to multiple names it only printed one of the names that correspond to it. I was wondering if anyone had a suggestion to how to approach this. As I want both the position of the sequence and name. Is there an option through bedtools?
My script stores both files into their own hashes and then loops through then if they are equal it is suppose to print out matches in sequences with the appropriate names. It does this but it does not give errors if multiple sequences correspond to the name, it simply does not print them. So my conclusion is that the foreach loop is failing syntax wise in some form I am not noticing. Any suggestions? Cheers.
sample data: -name output bedtools
>sequence_a
AGGT
>sequence_b
AAAA
>sequence_c
CCCC
>sequence_d
AAAA
sample data: -tab output bedtools
>1-5
AAAA
>10-14
ACCT
>15-19
CCCC
expected output from script
>sequence_b|1-5
AAAA
>sequence_c|15-19
CCCC
>sequence_d|1-5
AAAA
Script
my %sequence;
open(NAMES_FILE, $ARGV[0]) or die "Cannot open the file: $!";
my $hash_key_name;
my $hash_value_name;
while (my $line = <NAMES_FILE>) {
if ($line =~ /^>(\S+)/) {
$hash_key_name = $1;
}
elsif ($line =~ /\S/) {
chomp $line;
$hash_value_name = $line;
$sequence{$hash_key_name} = $hash_value_name;
}
}
my %sequence_2;
open (POSITIONS_FILE, $ARGV[1]) or die "Cannot open the file: $!";
my $hash_key_pos;
my $hash_value_pos;
while (my $line2 = <POSITIONS_FILE>) {
if ($line2 =~ /^>(\S+)/) {
$hash_key_pos = $1;
}
elsif ($line2 =~ /\S/) {
chomp $line2;
$hash_value_pos = $line2;
$sequence_2{$hash_key_pos} = $hash_value_pos;
}
}
foreach $hash_key_pos (keys %sequence_2) {
foreach $hash_key_name (keys %sequence) {
if ($sequence{$hash_key_name} eq $sequence_2{$hash_key_pos}){
print ">$hash_key_name|$hash_key_pos\n$sequence{$hash_key_name}\n"}
}
}
Upvotes: 1
Views: 102
Reputation: 4829
Hashes will happily overwrite values, saving only the latest value, without throwing errors. If you want to catch that, you will need to put an explicit check in to see if the hash has a value before you overwrite it, something like:
while (my $line = <NAMES_FILE>) {
if ($line =~ /^>(\S+)/) {
$hash_key_name = $1;
}
elsif ($line =~ /\S/) {
chomp $line;
$hash_value_name = $line;
if (defined($sequence{$hash_key_name}) && $sequence{$hash_key_name} ne $hash_value_name) {
die("multiple sequences match $hash_key_name: $sequence{$hash_key_name}, $hash_value_name");
}
$sequence{$hash_key_name} = $hash_value_name;
}
}
That being said, it would be most helpful if you could provide sample data which produced the error you are trying to catch. It looks as if the data above should not contain this error.
Upvotes: 1