Reputation: 213
I have the following bash script which pulls a list of numbers from a file. I want to maintain a log of the order in which they were pulled (that is important information). So I got some help (possibly from an example I found on here) of dumping the information into an array, sorting and outputting the information.
if [ ! -z "$sort" ]; then
if [[ $sort == ascending ]]; then
gawk '/SCF Done/\
{c++; list[$5]=c}
END {
asorti(list,energies);
for (i=1;i<=c;i++)
printf("%s%s%d\n",energies[i]," - Optimization Step #",list[energies[i]])
print "Total Optimization Steps: "c}
' "$1"
The only issue, is that I found there is a chance the value stored in the $5
field from the line can be repeated. So during the initial building of the array, list[$5]
, this value might be non-unique, and hence the previous value of c
gets overwritten. I've thought of a few things (multiplying the value of $5
by some random number, and then redividing that out afterwards), but I would not be surprised if there is an already established (and more efficient) method for dealing with this problem that I'm unaware of.
Here is the output of a grep "SCF Done"
SCF Done: E(UM11L) = -1267.67892101 A.U. after 41 cycles
SCF Done: E(UM11L) = -1267.64771239 A.U. after 43 cycles
SCF Done: E(UM11L) = -1267.67892101 A.U. after 39 cycles
SCF Done: E(UM11L) = -1267.67892578 A.U. after 24 cycles
SCF Done: E(UM11L) = -1267.67892051 A.U. after 24 cycles
SCF Done: E(UM11L) = -1267.67892201 A.U. after 22 cycles
The whole reason I switched to the gawk format was because I want to pull those middle numbers, then also create a formatted output that reads like the following. I originally used a simple grep "SCF Done"
statement, but then getting the formatting, the sorting and etc, was starting to become a rather cumbersome statement to write. The fact is still the same, I want to be able to sort by those numbers, while retaining the correlation between the number and the optimization step (as shown below). But the numbers don't always have to be unique.
-1267.67892101 - Optimization Step #1
-1267.64771239 - Optimization Step #2
-1267.67892101 - Optimization Step #3
-1267.67892578 - Optimization Step #4
-1267.67892051 - Optimization Step #5
-1267.67892201 - Optimization Step #6
Upvotes: 0
Views: 121
Reputation: 107040
Am I missing where the sort is coming into play? If you are worried about repeating lines, simply skip the line if it was the same as your previous line:
$ awk
'END { print "total steps: " count }
/SCF Done/ {
if ( prev5 == $5 ) {
continue # Skip duplicate line
}
count++
printf "%s - Optimization step #%d\n", $5, count
prev5 = $5
}'
If you really don't want a line to ever repeat, use arrays to store the value of $5 as the key to the array. Then, you can use the array to see if you have ever hit that line. All arrays in awk
are really hashes:
$ awk
'END { print "total steps: " count }
{
if ( $0 ~ /SCF Done/ ) {
if ( prev[$5] == 1 ) {
continue # Seen that value of $5 before. Skip
}
count++
printf "%s - Optimization step #%d\n", $5, count
prev[$5] = 1 # Mark that you've printed $5 out
}
}'
Upvotes: 0
Reputation: 246837
why are you sorting with gawk
instead of sort
?
I don't quite get what you're trying to accomplish from your code snippet, but perhaps:
grep 'SCF Done' "$1" | cut -f5 | cat -n | sort -k 2
I see. How about calling out to sort instead of using awk's array sorting.
awk '
/SCF Done/ {
printf "%s - Optimization step #%d\n", $5, ++n | "sort"
}
END {
close("sort")
print "total steps:", n
}
' file
which would look like:
-1267.64771239 - Optimization step #2
-1267.67892051 - Optimization step #5
-1267.67892101 - Optimization step #1
-1267.67892101 - Optimization step #3
-1267.67892201 - Optimization step #6
-1267.67892578 - Optimization step #4
total steps: 6
Upvotes: 2