AngryPanda
AngryPanda

Reputation: 1281

Custom Sort Multiple Files

I have 10 files (1Gb each). The contents of the files are as follows:

head -10 part-r-00000

a a a c b   1   
a a a dumbbell  1   
a a a f a   1   
a a a general i 2   
a a a glory 2   
a a a h d   1   
a a a h o   4   
a a a h z   1   
a a a hem hem   1   
a a a k 3   

I need to sort the file based on the last column of each line (descending order), which is of variable length. If there is a match on the numerical value then sort alphabetically by the 2nd last column. The following BASH command works on small datasets (not complete files) and takes 3 second to sort only 10 lines from one file.

cat part-r-00000 | awk '{print $NF,$0}' | sort -nr | cut -f2- -d' ' > FILE

I want the output in a separate FILE. Can someone help me out to speed up the process?

Upvotes: 1

Views: 126

Answers (3)

gboffi
gboffi

Reputation: 25083

You can use a Schwartzian transform to accomplish your task,

awk '{print -$NF, $(NF-1), $0}' input_file | sort -n | cut -d' ' -f3-
  1. The awk command prepends each record with the negative of the last field and the second last field.

  2. The sort -n command sorts the record stream in the required order because we used the negative of the last field.

  3. The cut command splits on spaces and cuts the first two fields, i.e., the ones we used to normalize the sort

Example

$ echo 'a a a c b   1   
a a a dumbbell  1   
a a a f a   1   
a a a general i 2   
a a a glory 2   
a a a h d   1   
a a a h o   4   
a a a h z   1   
a a a hem hem   1   
a a a k 3' | awk '{print -$NF, $(NF-1), $0}' | sort -n | cut -d' ' -f3-
a a a h o   4   
a a a k 3
a a a glory 2   
a a a general i 2   
a a a f a   1   
a a a c b   1   
a a a h d   1   
a a a dumbbell  1   
a a a hem hem   1   
a a a h z   1   
$ 

Upvotes: 1

Cyrus
Cyrus

Reputation: 88776

Reverse order, sort and reverse order:

awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}' file | sort -nr | awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}'

Output:

a a a h o 4 
a a a k 3 
a a a general i 2 
a a a glory 2 
a a a h z 1 
a a a hem hem 1 
a a a dumbbell 1 
a a a h d 1 
a a a c b 1 
a a a f a 1 

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204164

No, once you get rid of the UUOC that's as fast as it's going to get. Obviously you need to add the 2nd-last field to everything too, e.g. something like:

awk '{print $NF,$(NF-1),$0}' part-r-00000 | sort -k1,1nr -k2,2 | cut -f3- -d' '

Check the sort args, I always get mixed up with those..

Upvotes: 2

Related Questions