Reputation: 480
I am using a regex
but am getting some odd, unexpected "matches". "Names" are sent to a subroutine to be compared to an array called @ASlist
, which contains multiple rows. The first element of each row is also a name, followed by 0 to several synonyms. The goal is to match the incoming "name" to any row in @ASlist
that has a matching cell.
Sample input, from which $names
is derived for the comparison against @ASlist
:
13 1 13 chr7 7 70606019 74345818 Otud7a Klf13 E030018B13Rik Trpm1 Mir211 Mtmr10 Fan1 Mphosph10 Mcee Apba2 Fam189a1 Ndnl2 Tjp1 Tarsl2 Tm2d3 1810008I18Rik Pcsk6 Snrpa1 H47 Chsy1 Lrrk1 Aldh1a3 Asb7 Lins Lass3 Adamts17
Sample lines from @ASlist:
HSPA5 BIP FLJ26106 GRP78 MIF2
NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13
ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK
The code:
my ($name) = @_; ## this comes in from another loop elsewhere in code I did not include
chomp $name;
my @collectmatches = (); ## container to collect matches
foreach my $ASline ( @ASlist ){
my @synonyms = split("\t", $ASline );
for ( my $i = 0; $i < scalar @synonyms; $i++ ){
chomp $synonyms[ $i ];
#print "COMPARE $name TO $synonyms[ $i ]\n";
if ( $name =~m/$synonyms[$i]/ ){
print "\tname $name from block matches\n\t$synonyms[0]\n\tvia $synonyms[$i] from AS list\n";
push ( @collectmatches, $synonyms[0], $synonyms[$i] );
}
else {
# print "$name does not match $synonyms[$i]\n";
}
}
}
The script is working but also reports weird matches. Such as, when $name
is "E030018B13Rik" it matches "NDUFA5" when it occurs in @ASlist
. These two should not be matched up.
If I change the regex from ~m/$synonyms[$i]/
to ~m/^$synonyms[$i]$/
, the "weird" matches go away, BUT the script misses the vast majority of matches.
Upvotes: 0
Views: 178
Reputation: 126742
Another, more Perlish way to test for string equality is to use a hash.
You don't show any real test data, but this short Perl program builds a hash from your array @ASlist
of lines of match strings. After that, most of the work is done.
The subsequent for
loop tests just E030018B13Rik
to see if it is one of the keys of the new %ASlist
and prints an appropriate message
use strict;
use warnings;
my @ASlist = (
'HSPA5 BIP FLJ26106 GRP78 MIF2',
'NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13',
'ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK',
);
my %ASlist = map { $_ => 1 } map /\S+/g, @ASlist;
for (qw/ E030018B13Rik /) {
printf "%s %s\n", $_, $ASlist{$_} ? 'matches' : 'doesn\'t match';
}
output
E030018B13Rik doesn't match
Upvotes: 1
Reputation: 241988
You are using B13
as the regular expression. As none of the characters has a special meaning, any string containing the substring B13
matches the expression.
E030018B13Rik
^^^
If you want the expression to match the whole string, use anchors:
if ($name =~m/^$synonyms[$i]$/) {
Or, use index
or eq
to detect substrings (or identical strings, respectively), as your input doesn't seem to use any features of regular expressions.
Upvotes: 0
Reputation: 35208
The NDUFA5
record contains B13
as a pattern, which will match E030018<B13>Rik
.
If you want to be more literal, then add boundary conditions to your regular expression /\b...\b/
. Also should probably escape regular expression special characters using quotemeta
.
if ( $name =~ m/\b\Q$synonyms[$i]\E\b/ ) {
Or if you want to test straight equality, then just use eq
if ( $name eq $synonyms[$i] ) {
Upvotes: 1
Reputation: 89584
Since you only need to compare two strings, you can simply use eq:
if ( $name eq $synonyms[$i] ){
Upvotes: 0