Hot JAMS
Hot JAMS

Reputation: 193

awk: compute mean values for the data in distinct files

I am using bash + awk to extract some information from log files located within the directory and save the summary in the separate file. In the bottom of each log file, there is a table like:

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
   1       -6.961          0          0
   2       -6.797      2.908      4.673
   3       -6.639      27.93      30.19
   4       -6.204      2.949      6.422
   5       -6.111      24.92      28.55
   6       -6.058      2.836      7.608
   7       -5.986      6.448      10.53
   8        -5.95      19.32      23.99
   9       -5.927      27.63      30.04
  10       -5.916      27.17      31.29
  11       -5.895      25.88      30.23
  12       -5.835      26.24      30.36

from this I need to focus on the (negative) values located in the second column. Notably I need to take 10 first values from the second column (from -6.961 to -5.916) and compute the mean for it and save the mean value together with the name of the log as one string in new ranking.log so for 5 processed logs it should be something like:

# ranking_${output}.log
log_name1 -X.XXX
log_name2 -X.XXX
log_name3 -X.XXX
log_name4 -X.XXX
log_name5 -X.XXX

Where -X.XXX is the mean value computed for each log for first 10 positions.

Here is my awk code integrated in bash function, which extract the first value (-6.961 in the example table ) from each log (without mean computation).

 # take only the first line (lowest dG) from each log
take_the_first_value () {
    awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*.log  > "${results}"/ranking.csv
} 

May I modify the AWK part to add the computing of the MEAN values instead of taking always the value located in the first line of the table?

Upvotes: 0

Views: 62

Answers (3)

Ed Morton
Ed Morton

Reputation: 204731

With GNU awk for ENDFILE:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    ($2+0) < 0 {
        sum += $2
        if ( ++cnt == 10 ) {
            nextfile
        }
    }
    ENDFILE {
        print FILENAME, (cnt ? sum/cnt : 0)
        cnt = sum = 0
    }
' "${@:--}"

$ ./tst.sh file
file -6.2549

Note that the above will work even if your input files have fewer than 10 lines at the end, including empty files.

Upvotes: 2

Cyrus
Cyrus

Reputation: 88999

I suggest with GNU awk:

awk -v num=10 'BEGINFILE{ c=sum=0 }
     $1~/^[0-9]+$/ && NF==4{
       c++; sum=sum+$2;
       if(c==num){
         sub(/.*\//, "", FILENAME);
         print FILENAME, sum/num
       }
     }' "${results}"/*.log  >> "${results}"/ranking.csv

I used $1~/^[0-9]+$/ && NF==4 to identify the correct lines.

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19271

This gives you the averages. The pattern used to find the first value is the line ^---+--- followed by a [:digit:] in the first field of the next line. For each log file do

$ awk '$1~/[[:digit:]]/ && set==1{ x+=$2; i++;
    gsub(/\/*.*\//,"", FILENAME);               
    if(i==10){ set=0; print FILENAME, x/i; i=0; x=0 } } 
    /^\-+\+\-+/{ set=1 }' "${results}"/*.log > "${results}"/ranking.csv

Upvotes: 1

Related Questions