user2460953
user2460953

Reputation: 363

What does the default sorting in Perl sort by on an array?

I am reading some Perl code that sorts an array of hashes

uniq sort @{$ProjectData{$project}{Packages}}

I am not clear on whether this is sorting by the key of the hash, the key+value of the hash, or the memory address of the hash. Also, I'm not clear on whether uniq is stable or not. I am going to write my own compare function, but it seems or work as is, so I'd appreciate if someone could clear up what is happening currently.

Upvotes: 0

Views: 492

Answers (3)

ikegami
ikegami

Reputation: 385819

sort sorts the scalars passed to in on the stack[1]. In your case, the scalars placed on the stack are the elements of the array referenced by $ProjectData{$project}{Packages}.

The default compare function sorts the values in ascending lexical order.

 sort LIST

is short for

 sort { $a cmp $b } LIST

This means the elements to compare are stringified, and compared character-by-character. This sorts abc before def, and 123 before 13.

But you're comparing the stringification of references. The following are examples of stringification of references:

$ perl -E'say {}'
HASH(0x1a36ff8)

$ perl -E'say bless({}, "SomeClass");'
SomeClass=HASH(0x41d0a28)

As you can surmise, the only thing you can count on sort doing when given references is placing identical references together. It won't place references to identical hashes together, only references to the same hash.


uniq from List::MoreUtils is stable.

The order of elements in the returned list is the same as in LIST.

It also uses a lexical comparison of the of the scalars. This means it will only remove references to the same hash, not references to identical hashes.

$ perl -MData::Dumper=Dumper -MList::MoreUtils=uniq -e'
   my $x = { a => 234, b => 345 };
   my $y = $x;
   my $z = { a => 234, b => 345 };
   print(Dumper([ uniq $x, $y, $z ]));
'
$VAR1 = [
          {
            'b' => 345,
            'a' => 234
          },
          {
            'b' => 345,
            'a' => 234
          }
        ];

Tip: Unless you're using a method of removing duplicates that requires sorted input, it makes more sense to remove the duplicates before sorting the values.

sort +uniq @{$ProjectData{$project}{Packages}}

(The + prevents uniq from being used as sort's compare function.)

But since all that sort will meaningfully accomplish when given references is placing identical references together, and since uniq already eliminated identical references, the above can be simplified to

uniq @{$ProjectData{$project}{Packages}}

It's far more likely that you want to filter out references to duplicate hashes rather than references to the same hash. Here's how you can do that:

use JSON::XS qw( );

my $json_serializer = JSON::XS->new->canonical;

my %seen;
my @uniques = grep !$seen{ $json_serializer->encode($_) }++, @references;

For example,

$ perl -MData::Dumper=Dumper -MJSON::XS -e'
   my $json_serializer = JSON::XS->new->canonical;
   my $x = { a => 234, b => 345 };
   my $y = $x;
   my $z = { a => 234, b => 345 };
   my %seen;
   print(Dumper([ grep !$seen{ $json_serializer->encode($_) }++, $x, $y, $z ]));
'
$VAR1 = [
          {
            'b' => 345,
            'a' => 234
          }
        ];

  1. Ok, technically, it can actually be passed an array as an optimization, but that's internal details.

Upvotes: 0

Borodin
Borodin

Reputation: 126722

If $ProjectData{$project}{Packages} is a reference to an array of hash references, then Perl will stringify all those references to something like HASH(0xf1f6d290) and sort them as strings

Effectively it's sorting by memory address, but that's really not very useful and you may as well drop the sort, leaving you with

uniq @{ $ProjectData{$project}{Packages} }

If you're using List::Util::uniq then it's stable, but unless you sort the data by something a bit more useful it really doesn't matter

Upvotes: 1

Schwern
Schwern

Reputation: 164819

As for sort @{$ProjectData{$project}{Packages}}, that's just sorting the contents of the array @{$ProjectData{$project}{Packages}} using the default sorting.

According to the docs it uses "standard string comparison order" which means it uses cmp. It means it sorts by turning the arguments into strings and comparing them one character at a time in the order they appear in their character encoding. We used to call this "ASCIIbetical" but now with UTF-8 it's not necessarily ASCII.

Some examples.

# 1, 20, 3 because it considers them "1", "20", "3".
sort 1, 3, 20;

# a, z
sort "z", "a";

# Z, a because ASCII Z is 90 and ASCII a is 97.
sort "Z", "a";

# z, ä because UTF-8 z is 122 and UTF-8 ä is 195 164
use utf8;
sort "z", "ä";

Upvotes: 0

Related Questions