Sorting on multiple columns w/ an output file per key

Question

I'm uncertain as to how I can use the until loop inside a while loop.

I have an input file of 500,000 lines that look like this:

   9       1       1  0.6132E+02
   9       2       1  0.6314E+02
  10       3       1  0.5874E+02
  10       4       1  0.5266E+02
  10       5       1  0.5571E+02
   1       6       1  0.5004E+02
   1       7       1  0.5450E+02
   2       8       1  0.5696E+02
  11       9       1  0.6369E+02
  .....

And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:

 1       6       1  0.5004E+02
 1       7       1  0.5450E+02
 1      11       1  0.6777E+02 
 ....

as well as an output.txt file that would look like this:

 1       6       1  0.5004E+02
 2     487       1  0.3495E+02
 3      34       1  0.0344E+02
 ....

Here is what I've written:

#!/bin/bash

input='input.txt'
i=1

sort -nk 1 $input > 'temp.txt'

while read line; do
   awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
     until [[$i -lt 20]]; do
     i=$((i+1))
   done
done

for f in *.txt; do
   sort -nk 4 > temp2.txt
   head -1 temp2.txt
   rm temp2.txt
done > output.txt

Charles Duffy · Accepted Answer

This only takes one line, if your sort -n knows how to handle exponential notation:

sort -nk 1,4 >of }'

...or, to also write the first line for each index to output.txt:

sort -nk 1,4 "output.txt"
      last=$1
    }
    of="cluster" $1 ".txt";
    print $0 >of
  }'

Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.

By the way, let's look at what was wrong with the original script:

It was slow. Really, really slow.

Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt. Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].

To "fix" the inner loop, to do what I imagine you meant to write, might look like this:

# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
  i=0
  until [[ $i -ge 20 ]]; do
    awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
    i=$((i+1))
  done
done



...or, somewhat better (but still not as good as the solution suggested at the top):

# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
  awk -v var="$i" '$1 == var' "cluster${i}.txt"
  head -n 1 "cluster${i}.txt"
done >output.txt


Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Sorting on multiple columns w/ an output file per key

Answers (1)

Related Questions