Sorting an array, but possible duplicates

Question

I have the following bash script which pulls a list of numbers from a file. I want to maintain a log of the order in which they were pulled (that is important information). So I got some help (possibly from an example I found on here) of dumping the information into an array, sorting and outputting the information.

if [ ! -z "$sort" ]; then
  if [[ $sort == ascending ]]; then
    gawk '/SCF Done/\
           {c++; list[$5]=c}
           END {
                 asorti(list,energies);
                 for (i=1;i<=c;i++)
                 printf("%s%s%d
",energies[i]," - Optimization Step #",list[energies[i]])
                 print "Total Optimization Steps: "c}
           ' "$1"

The only issue, is that I found there is a chance the value stored in the $5 field from the line can be repeated. So during the initial building of the array, list[$5], this value might be non-unique, and hence the previous value of c gets overwritten. I've thought of a few things (multiplying the value of $5 by some random number, and then redividing that out afterwards), but I would not be surprised if there is an already established (and more efficient) method for dealing with this problem that I'm unaware of.

Here is the output of a grep "SCF Done"

 SCF Done:  E(UM11L) =  -1267.67892101     A.U. after   41 cycles
 SCF Done:  E(UM11L) =  -1267.64771239     A.U. after   43 cycles
 SCF Done:  E(UM11L) =  -1267.67892101     A.U. after   39 cycles
 SCF Done:  E(UM11L) =  -1267.67892578     A.U. after   24 cycles
 SCF Done:  E(UM11L) =  -1267.67892051     A.U. after   24 cycles
 SCF Done:  E(UM11L) =  -1267.67892201     A.U. after   22 cycles

The whole reason I switched to the gawk format was because I want to pull those middle numbers, then also create a formatted output that reads like the following. I originally used a simple grep "SCF Done" statement, but then getting the formatting, the sorting and etc, was starting to become a rather cumbersome statement to write. The fact is still the same, I want to be able to sort by those numbers, while retaining the correlation between the number and the optimization step (as shown below). But the numbers don't always have to be unique.

-1267.67892101 - Optimization Step #1
-1267.64771239 - Optimization Step #2
-1267.67892101 - Optimization Step #3
-1267.67892578 - Optimization Step #4
-1267.67892051 - Optimization Step #5
-1267.67892201 - Optimization Step #6

glenn jackman · Accepted Answer

why are you sorting with gawk instead of sort?

I don't quite get what you're trying to accomplish from your code snippet, but perhaps:

grep 'SCF Done' "$1" | cut -f5 | cat -n | sort -k 2

I see. How about calling out to sort instead of using awk's array sorting.

awk '
    /SCF Done/ {
        printf "%s - Optimization step #%d
", $5, ++n | "sort"
    } 
    END {
        close("sort")
        print "total steps:", n
    }
' file

which would look like:

-1267.64771239 - Optimization step #2
-1267.67892051 - Optimization step #5
-1267.67892101 - Optimization step #1
-1267.67892101 - Optimization step #3
-1267.67892201 - Optimization step #6
-1267.67892578 - Optimization step #4
total steps: 6

Sorting an array, but possible duplicates

Answers (2)

Related Questions