Reputation: 23
I want to sort entries in a file in the following manner using Perl:
0filename
01filename
0111filename
0112filename
0113filename
02filename
021filename
022filename
1filename
11filename
110filename
1101filenam
2filename
21filename
211filename
212filename
I have used the sort function in Perl and the way I am getting the values now is:
0111filename
0112filename
0113filename
01filename
021filename
022filename
02filename
0filename
11filename
110filename
1101filename
1filename
211filename
212filename
21filename
2filename
I am able to conclude that somehow it is taking the ASCII value of the first character of the filename in cases involving 2filename and so on. Hence, if the case is 2file1 then it is comparing 2f with 21 and hence the wrong result!
Upvotes: 2
Views: 701
Reputation: 50647
Sort by using the Schwartzian transform:
use File::Slurp;
my @sorted =
map $_->[0],
sort { $a->[1] cmp $b->[1] } # String sort
# sort { $a->[1] <=> $b->[1] } # Numeric sort
map [ $_, /^(\d+)/ ],
read_file("file");
print @sorted;
Upvotes: 4
Reputation: 8449
The Perl sort function can be used with a user-supplied comparison function.
The first parameter should be a function or code block. The variables to compare are stored in parameters $a and $b.
For example, to sort as strings (as opposed to numbers):
sort {$a cmp $b} @list;
See more details in: sort perldoc
Upvotes: 2
Reputation: 107040
If you want to be able to sort your files by the numeric prefix, and then by the file name:
By default, Perl sorts arrays by ASCII alphabetic sequence1.
If that doesn’t meet your needs, Perl allows you to write your own comparison routine. Perl puts the two items being sorted in variables $a
and $b
. Your job is to figure out how to compare $a
and $b
. In your case, you want to split your string names apart into separate numeric and string parts, then compare each of those parts.
Sort uses two special operators: <=>
for numeric comparisons, and cmp
for string comparisons. Here's a quick example:
use strict;
use warnings;
use feature qw(say);
my @array = <DATA>;
chomp @array;
say join "\n", sort( sort_array @array ); # Parentheses are not necessary, but they make it easier to read
sub sort_array {
my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;
if ( $a_num != $b_num ) {
return $a_num <=> $b_num;
}
return $a_name cmp $b_name;
}
__DATA__
0111filename
0112filename
0113filename
01filename
021filename
022filename
02filename
0filename
11filename
110filename
1101filename
1filename
211filename
212filename
21filename
2filename
My subroutine is passed two values as $a
and $b
. I don't know exactly what these are, but they're two values from my array.
I then split $a
and $b
into numeric and string portions of my file name. I then compare the numeric portions. If the numeric portions don't match, whichever one is greater is sorted later. If the numeric portions are equal, I'll have to compare the string portion of that name using cmp
. What if these are also equal? It means $a
and $b
are equal, so I don't care -- just pass the result.
Depending upon your data, you may want a few more checks such as what if the file doesn't match the pattern /^(\d+)(*.)/
. (You probably just want a string sort). And what if there is no string part of the name? (That is, either $a_name
or $b_name
are undefined after the pattern match).
You can also embed your sort routine in the sort statement itself. Advantage: You have your sort statement right where you use it. Disadvantage: If it's really long, it can make your program harder to understand.
In other words, I could have done this:
say join ( "\n", sort {
my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;
if ( $a_num != $b_num ) {
return $a_num <=> $b_num;
}
return $a_name cmp $b_name;
} @array );
Embedding my sort subroutine right in my sort statement. You'll have to decide if this makes your program easier to understand.
Check the documentation of the sort function. It will give you some nice examples and help you understand how it works.
As pointed out by ikegami, the previous doesn't return the sort you wanted. I misunderstood what you're trying to sort. I thought you were sorting the numeric portion of your file name by numeric order and then the non-numeric order by file name.
However, it appears you're sorting the numeric prefix by string comparison order too. That is 02323filename
comes before 10filename
, because 02323
should be before 10
even though 2,323 is a bigger number.
No biggie. You simply change the numeric comparison operator in my previous example (<=>
) to a string comparison (cmp
):
say join ( "\n", sort {
my ( $a_num, $a_name ) = $a =~ /^(\d+)(.*)/;
my ( $b_num, $b_name ) = $b =~ /^(\d+)(.*)/;
if ( $a_num != $b_num ) {
return $a_num cmp $b_num;
}
return $a_name cmp $b_name;
} @array );
This now returns:
0filename
01filename
0111filename
0112filename
0113filename
02filename
021filename
022filename
1filename
11filename
110filename
1101filename
2filename
21filename
211filename
212filename
The previous sort (with a numeric comparison) returned:
0filename
01filename
1filename
02filename
2filename
11filename
021filename
21filename
022filename
110filename
0111filename
0112filename
0113filename
211filename
212filename
1101filename
The main point still holds: You can define your own sorting routine to sort the way you want.
By the way, the Schwartzian Transformation is a way of creating sort keys, sorting by those keys, and then stripping those keys. It can be more efficient.
Imagine a sorting routine like this:
sort { foo($a) <=> foo($b) } @array
It's nice and short, but for each comparison, you have to run the foo
subroutine twice. If it takes a while to compute foo
(let's say some value of money calculation), your sort could take a long time to run.
Instead, you go through your array, and generate sort keys for that array. A common way this happens is to create a hash. The sort key is stored in the hash, and the value is what you wanted to sort. You then sort the hash by key, and get your sort. In this case, foo
only has to be run once per sort key and not twice.
However, care must be taken. If you create duplicate sort keys, you could lose one of your values you're trying to sort in your hash. For example, I am looking at various mortgages I can get, and I want to sort them by the actual APR which I have to calculate. If two or more of those mortgages have the same APR, I will lose some of the mortgages in my sort. That's why you have to carefully follow the coding template outlined on the Wikipedia page.
In this case, for each comparison, I am splitting out the sort keys twice. It might make more sense to do a Schwartzian transformation, but if you don't have that many files, (say... less than 100,000 files to sort), you probably won't notice any difference unless you actually benchmark the routines.
In this case, it may simply be a matter of deciding whether the complexity of the sort transformation is worth the complexity and (obfuscation) in your code. If this was part of a server that served millions of requests per minute, it might be worth using a Schwartzian transformation2.
1. Not really true. Perl sorts in standard string comparison order which really depends upon your current collation locale. However, that depends upon Unicode, character encoding, your computer's locale setting, and a whole bunch of other stuff I purport to understand, but really don't, so I'll pretend it's just sorts in ASCII sequence if you do.
2 Then again, it might not. One of the worst things you can do is to optimize your code without knowing where you need to optimize. I've been on too many projects where the developers were spending days and weeks optimizing a section of code that was rarely utilized and never got around to improving the part of the code where the optimization might do the most good.
Upvotes: 1
Reputation: 385819
The leading zeroes are problematic. Everyone (including my earlier answer) resorted to a numerical comparison of the numeric part of the strings, but that considers 123 == 0123 which you don't.
The solution is to do a string comparison of the numeric part.
my @sorted = sort {
my ($a_num) = $a =~ /^(\d+)/;
my ($b_num) = $b =~ /^(\d+)/;
$a_num cmp $b_num
} @data;
If you have Perl 5.14, the following would be an itsy bit faster.
my @sorted = sort {
( $a =~ s/(?!\d)/\0/r ) cmp ( $b =~ s/(?!\d)/\0/r )
} @data;
You'll get more gains by switching to a Schwartzian Transform.
my @sorted =
map $_->[0],
sort { $a->[1] cmp $b->[1] }
map [ $_, /^(\d+)/ ],
@data;
The following in-place sort is even faster.
s/\0// for @data;
@data = sort @data;
s/(?!\d)/\0/ for @data;
The following is equally fast as a the in-place version, but it requires 5.14.
my @sorted =
map s/\0//r,
sort
map s/(?!\d)/\0/r,
@data;
Benchmark results:
Rate naive naive514 st grt514 inplace
naive 14775/s -- -12% -55% -70% -70%
naive514 16775/s 14% -- -49% -66% -66%
st 32630/s 121% 95% -- -33% -34%
grt514 48713/s 230% 190% 49% -- -1%
inplace 49211/s 233% 193% 51% 1% --
Benchmark code:
use strict;
use warnings;
use feature qw( say );
use Benchmark qw( cmpthese );
use List::Util qw( shuffle );
sub naive {
my @sorted = sort {
my ($a_num) = $a =~ /^(\d+)/;
my ($b_num) = $b =~ /^(\d+)/;
$a_num cmp $b_num
} @_;
1;
}
sub naive514 {
my @sorted = sort {
( $a =~ s/(?!\d)/\0/r ) cmp ( $b =~ s/(?!\d)/\0/r )
} @_;
1;
}
sub st {
my @sorted =
map $_->[0],
sort { $a->[1] cmp $b->[1] }
map [ $_, /^(\d+)/ ],
@_;
1;
}
sub grt514 {
my @sorted =
map s/\0//r,
sort
map s/(?!\d)/\0/r,
@_;
1;
}
sub inplace {
s/\0// for @_;
@_ = sort @_;
s/(?!\d)/\0/ for @_;
1;
}
my @data = shuffle qw(
0filename
01filename
0111filename
0112filename
0113filename
02filename
021filename
022filename
1filename
11filename
110filename
1101filename
2filename
21filename
211filename
212filename
);
cmpthese(-3, {
naive => sub { naive @data },
naive514 => sub { naive514 @data },
st => sub { st @data },
grt514 => sub { grt514 @data },
inplace => sub { inplace @data },
});
Upvotes: 3