Reputation: 77
I am trying to sort a text file where the lines are in the following format:
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
and want to sort numerically descending by the number at the end (i.e 6 in this example). The lines do not have a predicable number of columns using space as a delimiter, but using ||| as a delimiter there are always 5 columns, and the final column always has 3 space delimited numbers, the last of which to sort by. The text file is around 15gb and I did have a perl script I wrote to do it but it only worked on my old laptop which had 32gb of RAM because perl loads the whole file at once. Now I am stuck with 8gb RAM and it just churns the swap file for days. I have heard that the standard linux sort command handles huge files more gracefully but I can't find a way to make it use the number at the end.
Upvotes: 7
Views: 1965
Reputation: 67900
Since the problem is RAM, perhaps you can reduce the memory required by using Tie::File
. It will allow you to refer to a line by its index in an array. You can get the numbers to sort by and use a Schwartzian transform to get a sorted list of indexes, and then simply reprint the file at the end.
use strict;
use warnings;
use Tie::File;
my $file = shift; # your filename argument
tie my @lines, 'Tie::File', $file or die $!;
my @list = map $_->[0], # restore line number
sort { $b->[1] <=> $a->[1] } # sort on captured number
map { [ $_, $lines[$_] =~ /(\d+)$/ ] } 0 .. $#lines;
# store an array ref [ ... ] containing line number and number to
# sort by
@lines = @lines[@list];
The last operation will save the file in the sorted order. Note that this is a permanent change, so make backups. It is also an expensive operation, probably, and Tie::File
has had some performance issues. Another way to do it, that is probably less expensive is to simply iterate over the list of numbers and printing line by line to a new file:
open my $fh, ">", "output.csv" or die $!;
for my $num (@list) {
print $fh $lines[$num], $/;
}
This printing directly to a file circumvents any shell caching required by redirecting output
Upvotes: 0
Reputation: 4445
Assuming I'm allowed to ruin the original file (make a copy otherwise), you can use sort on the last column by rolling through the file once and turning the last column into a predictable column number. I'm using the @
symbol as something that I assume will not be in your data. Anything can be substituted if that's a bad assumption.
sed -i 's/ /@/g; s/@\([^@]*\)$/ \1/;' in.txt
# the file now looks like "!@!@|||@whatever@||| 6"
sort --buffer-size=1G -nk 2 in.txt | sed 's/@/ /g' > sorted.txt
Upvotes: 0
Reputation: 3271
It seems that you want to order the file according to the last number, right?
So you can duplicate the last field at the start of the line with awk
awk -F, '{ print $NF, $0 }' prova
then sort the file with
sort -n -k1
and finally remove the fake first field:
sed 's/^[0-9][0-9]* //'
Here is the script:
awk -F, '{ print $NF, $0 }' prova | sort -n -k1 | sed 's/^[0-9][0-9]* //'
Upvotes: 1
Reputation: 290095
Maybe it is a bit tricky, but this mix of commands can make it:
awk '$1=$NF" "$1' file | sort -n | cut -d' ' -f2-
The main idea is that we print the file appending the last value in the front of the line, then we sort and we finally remove that value from the output.
awk '$1=$NF" "$1' file
As the parameter you want to sort by is the last one in the file, let's print it also in the first field.sort -n
Then we pipe to sort -n
, which sorts numerically.cut -d' ' -f2-
and we finally print out the value we temporally used.$ cat a
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
Showing each step:
$ awk '$1=$NF" "$1' a
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
Upvotes: 4