user2940663
user2940663

Reputation: 23

Advanced sorting using perl

I want to sort entries in a file in the following manner using Perl:

0filename
01filename
0111filename
0112filename
0113filename
02filename
021filename
022filename
1filename
11filename
110filename
1101filenam
2filename
21filename
211filename
212filename

I have used the sort function in Perl and the way I am getting the values now is:

0111filename
0112filename
0113filename
01filename
021filename
022filename
02filename
0filename
11filename
110filename
1101filename
1filename
211filename
212filename
21filename
2filename

I am able to conclude that somehow it is taking the ASCII value of the first character of the filename in cases involving 2filename and so on. Hence, if the case is 2file1 then it is comparing 2f with 21 and hence the wrong result!

Upvotes: 2

Views: 701

Answers (4)

mpapec
mpapec

Reputation: 50647

Sort by using the Schwartzian transform:

use File::Slurp;

my @sorted =
  map $_->[0],
  sort { $a->[1] cmp $b->[1] }   # String sort
  # sort { $a->[1] <=> $b->[1] } # Numeric sort
  map [ $_, /^(\d+)/ ],
  read_file("file");

print @sorted;

Upvotes: 4

Ophir Yoktan
Ophir Yoktan

Reputation: 8449

The Perl sort function can be used with a user-supplied comparison function.

The first parameter should be a function or code block. The variables to compare are stored in parameters $a and $b.

For example, to sort as strings (as opposed to numbers):

sort {$a cmp $b} @list;

See more details in: sort perldoc

Upvotes: 2

David W.
David W.

Reputation: 107040

If you want to be able to sort your files by the numeric prefix, and then by the file name:

By default, Perl sorts arrays by ASCII alphabetic sequence1.

If that doesn’t meet your needs, Perl allows you to write your own comparison routine. Perl puts the two items being sorted in variables $a and $b. Your job is to figure out how to compare $a and $b. In your case, you want to split your string names apart into separate numeric and string parts, then compare each of those parts.

Sort uses two special operators: <=> for numeric comparisons, and cmp for string comparisons. Here's a quick example:

use strict;
use warnings;
use feature qw(say);

my @array = <DATA>;
chomp @array;

say join "\n", sort( sort_array @array ); # Parentheses are not necessary, but they make it easier to read

sub sort_array {
    my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
    my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;

    if ( $a_num != $b_num ) {
        return $a_num <=> $b_num;
    }
    return $a_name cmp $b_name;
}

__DATA__
0111filename
0112filename
0113filename
01filename
021filename
022filename
02filename
0filename
11filename
110filename
1101filename
1filename
211filename
212filename
21filename
2filename

My subroutine is passed two values as $a and $b. I don't know exactly what these are, but they're two values from my array.

I then split $a and $b into numeric and string portions of my file name. I then compare the numeric portions. If the numeric portions don't match, whichever one is greater is sorted later. If the numeric portions are equal, I'll have to compare the string portion of that name using cmp. What if these are also equal? It means $a and $b are equal, so I don't care -- just pass the result.

Depending upon your data, you may want a few more checks such as what if the file doesn't match the pattern /^(\d+)(*.)/. (You probably just want a string sort). And what if there is no string part of the name? (That is, either $a_name or $b_name are undefined after the pattern match).

You can also embed your sort routine in the sort statement itself. Advantage: You have your sort statement right where you use it. Disadvantage: If it's really long, it can make your program harder to understand.

In other words, I could have done this:

say join ( "\n", sort  {
    my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
    my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;

    if ( $a_num != $b_num ) {
        return $a_num <=> $b_num;
    }
    return $a_name cmp $b_name;
} @array );

Embedding my sort subroutine right in my sort statement. You'll have to decide if this makes your program easier to understand.

Check the documentation of the sort function. It will give you some nice examples and help you understand how it works.


OOPS: Not quite right...

As pointed out by ikegami, the previous doesn't return the sort you wanted. I misunderstood what you're trying to sort. I thought you were sorting the numeric portion of your file name by numeric order and then the non-numeric order by file name.

However, it appears you're sorting the numeric prefix by string comparison order too. That is 02323filename comes before 10filename, because 02323 should be before 10 even though 2,323 is a bigger number.

No biggie. You simply change the numeric comparison operator in my previous example (<=>) to a string comparison (cmp):

say join ( "\n", sort  {
    my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
    my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;

    if ( $a_num != $b_num ) {
        return $a_num cmp $b_num;
    }
    return $a_name cmp $b_name;
} @array );

This now returns:

0filename
01filename
0111filename
0112filename
0113filename
02filename
021filename
022filename
1filename
11filename
110filename
1101filename
2filename
21filename
211filename
212filename

The previous sort (with a numeric comparison) returned:

0filename
01filename
1filename
02filename
2filename
11filename
021filename
21filename
022filename
110filename
0111filename
0112filename
0113filename
211filename
212filename
1101filename

The main point still holds: You can define your own sorting routine to sort the way you want.

By the way, the Schwartzian Transformation is a way of creating sort keys, sorting by those keys, and then stripping those keys. It can be more efficient.

Imagine a sorting routine like this:

sort { foo($a) <=> foo($b) } @array

It's nice and short, but for each comparison, you have to run the foo subroutine twice. If it takes a while to compute foo (let's say some value of money calculation), your sort could take a long time to run.

Instead, you go through your array, and generate sort keys for that array. A common way this happens is to create a hash. The sort key is stored in the hash, and the value is what you wanted to sort. You then sort the hash by key, and get your sort. In this case, foo only has to be run once per sort key and not twice.

However, care must be taken. If you create duplicate sort keys, you could lose one of your values you're trying to sort in your hash. For example, I am looking at various mortgages I can get, and I want to sort them by the actual APR which I have to calculate. If two or more of those mortgages have the same APR, I will lose some of the mortgages in my sort. That's why you have to carefully follow the coding template outlined on the Wikipedia page.

In this case, for each comparison, I am splitting out the sort keys twice. It might make more sense to do a Schwartzian transformation, but if you don't have that many files, (say... less than 100,000 files to sort), you probably won't notice any difference unless you actually benchmark the routines.

In this case, it may simply be a matter of deciding whether the complexity of the sort transformation is worth the complexity and (obfuscation) in your code. If this was part of a server that served millions of requests per minute, it might be worth using a Schwartzian transformation2.


1. Not really true. Perl sorts in standard string comparison order which really depends upon your current collation locale. However, that depends upon Unicode, character encoding, your computer's locale setting, and a whole bunch of other stuff I purport to understand, but really don't, so I'll pretend it's just sorts in ASCII sequence if you do.

2 Then again, it might not. One of the worst things you can do is to optimize your code without knowing where you need to optimize. I've been on too many projects where the developers were spending days and weeks optimizing a section of code that was rarely utilized and never got around to improving the part of the code where the optimization might do the most good.

Upvotes: 1

ikegami
ikegami

Reputation: 385819

The leading zeroes are problematic. Everyone (including my earlier answer) resorted to a numerical comparison of the numeric part of the strings, but that considers 123 == 0123 which you don't.

The solution is to do a string comparison of the numeric part.

my @sorted = sort {
   my ($a_num) = $a =~ /^(\d+)/;
   my ($b_num) = $b =~ /^(\d+)/;
   $a_num cmp $b_num
} @data;

If you have Perl 5.14, the following would be an itsy bit faster.

my @sorted = sort {
   ( $a =~ s/(?!\d)/\0/r ) cmp ( $b =~ s/(?!\d)/\0/r )
} @data;

You'll get more gains by switching to a Schwartzian Transform.

my @sorted =
   map $_->[0],
   sort { $a->[1] cmp $b->[1] }
   map [ $_, /^(\d+)/ ],
   @data;

The following in-place sort is even faster.

s/\0// for @data;
@data = sort @data;
s/(?!\d)/\0/ for @data;

The following is equally fast as a the in-place version, but it requires 5.14.

my @sorted =
   map s/\0//r,
   sort
   map s/(?!\d)/\0/r,
   @data;

Benchmark results:

            Rate    naive naive514       st   grt514  inplace
naive    14775/s       --     -12%     -55%     -70%     -70%
naive514 16775/s      14%       --     -49%     -66%     -66%
st       32630/s     121%      95%       --     -33%     -34%
grt514   48713/s     230%     190%      49%       --      -1%
inplace  49211/s     233%     193%      51%       1%       --

Benchmark code:

use strict;
use warnings;
use feature qw( say );

use Benchmark  qw( cmpthese );
use List::Util qw( shuffle );

sub naive {
   my @sorted = sort {
      my ($a_num) = $a =~ /^(\d+)/;
      my ($b_num) = $b =~ /^(\d+)/;
      $a_num cmp $b_num
   } @_;
   1;
}

sub naive514 {
   my @sorted = sort {
      ( $a =~ s/(?!\d)/\0/r ) cmp ( $b =~ s/(?!\d)/\0/r )
   } @_;
   1;
}

sub st {
   my @sorted =
      map $_->[0],
      sort { $a->[1] cmp $b->[1] }
      map [ $_, /^(\d+)/ ],
      @_;
   1;
}

sub grt514 {
   my @sorted =
      map s/\0//r,
      sort
      map s/(?!\d)/\0/r,
      @_;
   1;
}

sub inplace {
   s/\0// for @_;
   @_ = sort @_;
   s/(?!\d)/\0/ for @_;
   1;
}

my @data = shuffle qw(
   0filename
   01filename
   0111filename
   0112filename
   0113filename
   02filename
   021filename
   022filename
   1filename
   11filename
   110filename
   1101filename
   2filename
   21filename
   211filename
   212filename
);

cmpthese(-3, {
   naive    => sub { naive    @data },
   naive514 => sub { naive514 @data },
   st       => sub { st       @data },
   grt514   => sub { grt514   @data },
   inplace  => sub { inplace  @data },
});

Upvotes: 3

Related Questions