Reputation: 2086
I have two strings, and I would like to test if they are anagrams of each other.
To test if string A is an anagram of string B, the characters of A and B are both sorted. If the resulting sorted strings match exactly, string A and string B are anagrams of each other.
I am split
ing the strings up into character arrays, using Perl's sort
routine, join
ing the characters back together, and testing for string equality with eq
:
sub anagram
{
my ($s1, $s2) = @_;
return (join '', sort { $a cmp $b } split(//, $s1)) eq
(join '', sort { $a cmp $b } split(//, $s2));
}
Is there a way to avoid having to convert between the scalar and array types (relying on join
and split
)? And if so, which method is more efficient?
Upvotes: 14
Views: 6334
Reputation: 51137
Well, I've found a way that's over 30 times faster—though, arguably, its cheating. I've included the Benchmark.pm code to benchmark it, since you're apparently not familiar with it.
The benchmark is:
Rate Join Cheat
Join 83294/s -- -97%
Cheat 2580687/s 2998% --
And the code. After the third line, I think you'll understand why its arguably cheating:
use v5.14;
use Benchmark qw(cmpthese);
use Inline 'C';
sub an_join {
my ($s1, $s2) = @_;
return (join '', sort split(//, $s1)) eq
(join '', sort split(//, $s2));
}
use constant {
STR1 => 'abcdefghijklm',
STR2 => 'abcdefghijkmm',
STR3 => 'abcdefghijkml',
};
cmpthese(
0,
{
'Join' => 'an_join(STR1, STR2); an_join(STR1, STR3)',
'Cheat' => 'an_cheat(STR1, STR2); an_cheat(STR1, STR3)',
});
__END__
__C__
int an_cheat(const char *a, const char *b) {
unsigned char vec_a[26], vec_b[26];
const char *p, *end;
memset(vec_a, 0, sizeof(vec_a));
memset(vec_b, 0, sizeof(vec_b));
end = a+strlen(a);
for (p = a; p < end; ++p)
if (*p >= 'a' && *p <= 'z')
++vec_a[(*p)-'a'];
end = b+strlen(b);
for (p = b; p < end; ++p)
if (*p >= 'a' && *p <= 'z')
++vec_b[(*p)-'a'];
return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}
Of course, its cheating because its not written in Perl—its in C. Also, it has limitations the Perl version doesn't (only works with lowercase ASCII characters being the most significant—it just ignores everything else). But if you really need speed, you can use cheating like this.
edit:
Extending to all of Latin1 (well, raw 8-bit characters, really). Also, I found that the compiler managed to optimize a simpler loop (without point arithmetic) better, and its easier to read too, so... Benchmark tells me that the lowercase-ASCII-only version is about 10% faster:
int an_cheat_l1b(const char *a, const char *b) {
unsigned char vec_a[UCHAR_MAX], vec_b[UCHAR_MAX];
size_t len, i;
memset(vec_a, 0, sizeof(vec_a));
memset(vec_b, 0, sizeof(vec_b));
len = strlen(a);
for (i = 0; i < len; ++i)
++vec_a[((const unsigned char *)(a))[i]];
len = strlen(b);
for (i = 0; i < len; ++i)
++vec_b[((const unsigned char *)(b))[i]];
return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}
Note that the advantage of the C version grows as the string gets longer—which is expected, since its Θ(n) as opposed to the Perl versions O(n·logn). Also the penalty for full Latin1 decreases, meaning that penalty is probably the memcmp.
Upvotes: 7
Reputation: 118595
I thought using smart matches to compare the arrays without needing to recreate the string would have to chance to outperform the OP's method
sub anagram_smatch {
return [sort split//,$_[0]] ~~ [sort split//,$_[1]];
}
but the benchmarks don't bear that out.
Rate smatch join
smatch 1.73/s -- -54%
join 3.72/s 116% --
Upvotes: 2
Reputation: 183251
Is there a way to avoid having to convert between the scalar and array types (relying on
join
andsplit
)? And if so, which method is more efficient?
Since you ask these as two separate questions, I'll answer both.
Yes, there are ways to do this without creating an @
array or %
hash or whatnot, and I'll outline a few; but your way is more efficient than any of these.
One way is to treat a string as an array of characters by using the substr
function ($c = substr $s, 4, 1
sets $c
to the fifth element of $s
, and substr($s, 4, 1) = $c
sets the fifth element of $s
to $c
), and implement any typical sorting algorithm on it.
Alternatively, I'm pretty sure you could implement bubble-sort using just regex-substitutions with /e
.
Lastly, if you're willing to dispense with the sort-then-compare approach, you could write:
sub anagram
{
my ($s1, $s2) = @_;
while($s1 =~ m/\G(.)/s)
{ $s2 =~ s/\Q$1// or return ''; }
return $s2 eq '';
}
But, again, the split
-then-join
is more efficient than any of these.
Upvotes: 1
Reputation: 234795
If both strings are variable, I don't think you can do much better. One alternative is to build a hash that maps characters to their counts, and then compare that the hashes have the same keys and values. I believe that this is O(n) instead of O(n log n) for your approach, but it would probably have worse actual performance except on very long strings.
If you want to compare variable strings to a fixed reference string, then perhaps the hash-based approach might pay dividends earlier, since you only need to hash the reference once.
Upvotes: 1