StatsSorceress
StatsSorceress

Reputation: 3099

Bash: recursively find maximum value in a column in a file

I have a set of directories:

RUN1 RUN2 RUN3

Within each those directories, I have files. RUN1 has:

mod1_1 mod1_2 mod1_3

and RUN2 has:

mod2_1 mod2_2 mod2_3

etc.

Each file has lines like this (this is mod1_1):

8.69e-01 2.59e-01 7.82e-01 4.92e-01
8.69e-01 2.56e-01 7.84e-01 4.95e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01
8.73e-01 2.53e-01 7.81e-01 4.99e-01

And this is mod1_2:

8.69e-01 2.59e-01 7.82e-01 4.98e-01
8.69e-01 2.56e-01 7.84e-01 4.90e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01
8.73e-01 2.53e-01 7.81e-01 4.99e-01

I want to create a new file that contains only the smallest number in column 4 for each mod file. For example, suppose mod1_1 and mod2_1 are the only files. I want to create a new file that contains line 1 from mod1_1 and line 2 from mod2_1:

8.69e-01 2.59e-01 7.82e-01 4.92e-01  
8.69e-01 2.56e-01 7.84e-01 4.90e-01

I would like to do this for each RUN directory. I have tried this:

#/bin/bash

finddir=$(find -type d -name 'RUN*' | sort) #find the dirs
for i in $finddir; do
        cd $i
        echo $(pwd)
        findfiles=$(find -type f -name 'mod*' | sort -V) #find the files
        echo $findfiles
        for j in $findfiles; do
                s1=$(sort -k3,3 j)
                echo $s1
done

My problem is the sort command, and I don't know how to write the results to a file. Any ideas?

Pseudocode in case it's helpful:

For each directory RUN*
    For each file mod*
        get the minimum value in column 4, save the line that has that value
    End for 
    Write the lines that had the minimum values to a new file
End for

EDIT: Still having issues. Here's how I've modified:

#/bin/bash

finddir=$(find -type d -name 'RUN*' | sort) #find the dirs
for i in $finddir; do
        cd $i
        echo $(pwd)
        findfiles=$(find -type f -name 'mod*' | sort -V) #find the files
        for j in $findfiles; do
                s1=$(sort -k 4 -g $j)
                echo -n "$s1"
        done
cd ..
done

I was 'cd'ing in the wrong part. This is a bit better - it gives me the four numbers on each line - but it's not returning only the line with the smallest value of column 4 from each file. Also, I still don't know how to export the final results to a new file.

Upvotes: 2

Views: 139

Answers (2)

iamauser
iamauser

Reputation: 11469

for each of these files 1_1 or 1_2, following command should give you the row that has lowest number in the 4th column in that file:

~]$ cat 1_2
8.69e-01 2.59e-01 7.82e-01 4.98e-01
8.69e-01 2.56e-01 7.84e-01 4.90e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01
8.73e-01 2.53e-01 7.81e-01 4.99e-01

Now use sort -k

~]$ sort -k 4 test | head -1
8.69e-01 2.56e-01 7.84e-01 4.90e-01

Without head -1 you should see they are sorted according to the 4th column:

]$ sort -k 4 1_2
8.69e-01 2.56e-01 7.84e-01 4.90e-01
8.69e-01 2.59e-01 7.82e-01 4.98e-01
8.73e-01 2.53e-01 7.81e-01 4.99e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01

EDIT

#!/bin/bash
resultfile="somefile.txt"
for d in $(find . -type d -name 'RUN*');
do
  find $d -type f -name 'mod*' -exec sort -k4 -g {} \; | head -1 >> "$resultfile"
done

Upvotes: 1

andipla
andipla

Reputation: 383

There is a couple of problems: 1.) use $j instead of j in the sort command 2.) quote your variables on echo (see How do I preserve line breaks when storing a command output to a variable in bash? for details) 3.) you cd into a directory but never go back... you probably want to go back ...

I tested a simpler version of your code and (not going into directories) and that works:

#!/bin/bash

findfiles=$(find -type f -name 'mod*' | sort -V) #find the files
for j in $findfiles; do
       echo $j
       s1=$(sort -k 4 -g $j)
       echo "$s1"
 done

Note, that I used sort -g so floating point values are handled properly, e.g. if you change your data to (using 4.95e-02 instead of 4.95e-01 in the second row):

8.69e-01 2.59e-01 7.82e-01 4.92e-01
8.69e-01 2.56e-01 7.84e-01 4.95e-02
8.73e-01 2.53e-01 7.81e-01 4.99e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01

then without -g the order will be wrong:

 $ cat test.dat | sort -k 4
 8.69e-01 2.59e-01 7.82e-01 4.92e-01
 8.69e-01 2.56e-01 7.84e-01 4.95e-02
 8.73e-01 2.53e-01 7.81e-01 4.99e-01
 8.72e-01 2.54e-01 7.83e-01 5.00e-01
 8.71e-01 2.53e-01 7.84e-01 5.01e-01

using -g instead, order will handle the exponent correct:

$ cat test.dat | sort -k 4 -g
8.69e-01 2.56e-01 7.84e-01 4.95e-02
8.69e-01 2.59e-01 7.82e-01 4.92e-01
8.73e-01 2.53e-01 7.81e-01 4.99e-01
8.72e-01 2.54e-01 7.83e-01 5.00e-01
8.71e-01 2.53e-01 7.84e-01 5.01e-01

Upvotes: 1

Related Questions