ardnew
ardnew

Reputation: 2086

Perl: Sort characters within a string

I have two strings, and I would like to test if they are anagrams of each other.

To test if string A is an anagram of string B, the characters of A and B are both sorted. If the resulting sorted strings match exactly, string A and string B are anagrams of each other.

I am spliting the strings up into character arrays, using Perl's sort routine, joining the characters back together, and testing for string equality with eq:

sub anagram
{
  my ($s1, $s2) = @_;

  return (join '', sort { $a cmp $b } split(//, $s1)) eq
         (join '', sort { $a cmp $b } split(//, $s2));
}

Is there a way to avoid having to convert between the scalar and array types (relying on join and split)? And if so, which method is more efficient?

Upvotes: 14

Views: 6334

Answers (4)

derobert
derobert

Reputation: 51137

Well, I've found a way that's over 30 times faster—though, arguably, its cheating. I've included the Benchmark.pm code to benchmark it, since you're apparently not familiar with it.

The benchmark is:

           Rate  Join Cheat
Join    83294/s    --  -97%
Cheat 2580687/s 2998%    --

And the code. After the third line, I think you'll understand why its arguably cheating:

use v5.14;
use Benchmark qw(cmpthese);
use Inline 'C';

sub an_join {
    my ($s1, $s2) = @_;
    return (join '', sort split(//, $s1)) eq
        (join '', sort split(//, $s2));
}

use constant {
    STR1 => 'abcdefghijklm',
    STR2 => 'abcdefghijkmm',
    STR3 => 'abcdefghijkml',
};

cmpthese(
    0,
    {
        'Join'  => 'an_join(STR1, STR2);  an_join(STR1, STR3)',
        'Cheat' => 'an_cheat(STR1, STR2); an_cheat(STR1, STR3)',
    });

__END__
__C__

int an_cheat(const char *a, const char *b) {
    unsigned char vec_a[26], vec_b[26];
    const char *p, *end;

    memset(vec_a, 0, sizeof(vec_a));
    memset(vec_b, 0, sizeof(vec_b));

    end = a+strlen(a);
    for (p = a; p < end; ++p)
        if (*p >= 'a' && *p <= 'z')
            ++vec_a[(*p)-'a'];
    end = b+strlen(b);
    for (p = b; p < end; ++p)
        if (*p >= 'a' && *p <= 'z')
            ++vec_b[(*p)-'a'];

    return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}

Of course, its cheating because its not written in Perl—its in C. Also, it has limitations the Perl version doesn't (only works with lowercase ASCII characters being the most significant—it just ignores everything else). But if you really need speed, you can use cheating like this.

edit:

Extending to all of Latin1 (well, raw 8-bit characters, really). Also, I found that the compiler managed to optimize a simpler loop (without point arithmetic) better, and its easier to read too, so... Benchmark tells me that the lowercase-ASCII-only version is about 10% faster:

int an_cheat_l1b(const char *a, const char *b) {
    unsigned char vec_a[UCHAR_MAX], vec_b[UCHAR_MAX];
    size_t len, i;

    memset(vec_a, 0, sizeof(vec_a));
    memset(vec_b, 0, sizeof(vec_b));

    len = strlen(a);
    for (i = 0; i < len; ++i)
        ++vec_a[((const unsigned char *)(a))[i]];
    len = strlen(b);
    for (i = 0; i < len; ++i)
        ++vec_b[((const unsigned char *)(b))[i]];

    return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}

Note that the advantage of the C version grows as the string gets longer—which is expected, since its Θ(n) as opposed to the Perl versions O(n·logn). Also the penalty for full Latin1 decreases, meaning that penalty is probably the memcmp.

Upvotes: 7

mob
mob

Reputation: 118595

I thought using smart matches to compare the arrays without needing to recreate the string would have to chance to outperform the OP's method

sub anagram_smatch {
    return [sort split//,$_[0]] ~~ [sort split//,$_[1]];
}

but the benchmarks don't bear that out.

         Rate smatch   join
smatch 1.73/s     --   -54%
join   3.72/s   116%     --

Upvotes: 2

ruakh
ruakh

Reputation: 183251

Is there a way to avoid having to convert between the scalar and array types (relying on join and split)? And if so, which method is more efficient?

Since you ask these as two separate questions, I'll answer both.

Yes, there are ways to do this without creating an @array or %hash or whatnot, and I'll outline a few; but your way is more efficient than any of these.

One way is to treat a string as an array of characters by using the substr function ($c = substr $s, 4, 1 sets $c to the fifth element of $s, and substr($s, 4, 1) = $c sets the fifth element of $s to $c), and implement any typical sorting algorithm on it.

Alternatively, I'm pretty sure you could implement bubble-sort using just regex-substitutions with /e.

Lastly, if you're willing to dispense with the sort-then-compare approach, you could write:

sub anagram
{
    my ($s1, $s2) = @_;
    while($s1 =~ m/\G(.)/s)
      { $s2 =~ s/\Q$1// or return ''; }
    return $s2 eq '';
}

But, again, the split-then-join is more efficient than any of these.

Upvotes: 1

Ted Hopp
Ted Hopp

Reputation: 234795

If both strings are variable, I don't think you can do much better. One alternative is to build a hash that maps characters to their counts, and then compare that the hashes have the same keys and values. I believe that this is O(n) instead of O(n log n) for your approach, but it would probably have worse actual performance except on very long strings.

If you want to compare variable strings to a fixed reference string, then perhaps the hash-based approach might pay dividends earlier, since you only need to hash the reference once.

Upvotes: 1

Related Questions