humblecomputeruser
humblecomputeruser

Reputation: 41

Finding the mean values of a field using awk?

This is what I am trying to do:

find the mean values for x,y,z for the HETATM records , the x value are the 7 field, the y values are the 8 field, and z values are the 9 field.

I am trying to do this using this file http://pastebin.com/EqA2SUMy

Here is the sample

  HETATM 1756  O   HOH A 501      -0.923  10.560 127.393  1.00 16.58           O  
 HETATM 1757  O   HOH A 502       9.272  22.148 134.167  1.00 15.08           O  
 HETATM 1758  O   HOH A 503       0.109  20.243 112.094  1.00 20.74           O  
 HETATM 1759  O   HOH A 504      -3.930  10.522 125.779  1.00 20.79           O  
 HETATM 1760  O   HOH A 505      -0.759  36.323  88.018  1.00 17.42           O  
 HETATM 1761  O   HOH A 506      -4.645  51.936  81.852  1.00 21.43           O  
 HETATM 1762  O   HOH A 507      -3.900  17.103 128.596  1.00 14.08           O  
 HETATM 1763  O   HOH A 508       6.834  21.053 135.062  1.00 16.98           O  

Can anyone show me how to do a script for this.

(this part is related to a comment viewers can ignore)

        ATOM    214  OE2 GLU A 460      -2.959  24.000 103.360  1.00 32.19           O  
     ATOM    215  N   ARG A 461      -5.878  28.748 106.473  1.00 22.68           N  
     ATOM    216  CA  ARG A 461      -6.553  30.043 106.524  1.00 24.34           C  
    ATOM    217  C   ARG A 461      -5.583  31.176 106.219  1.00 22.42           C  
    ATOM    218  O   ARG A 461      -5.918  32.121 105.497  1.00 25.07           O  
   ATOM    219  CB  ARG A 461      -7.222  30.272 107.887  1.00 24.53           C  
   ATOM    220  CG  ARG A 461      -8.425  29.394 108.150  1.00 26.38      

Upvotes: 0

Views: 107

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753525

It's not rocket science. (Updated to catch only HETATM records — a trivial change; you can use more exacting regexes if you need to. However, it is also necessary to count which records match and use that count, not NR, since you're ignoring many records, in general.)

awk '/HETATM/ { sum7 += $7; sum8 += $8; sum9 += $9; count++ }
     END { if (count > 0)
               printf("avg(x) = %f, avg(y) = %f, avg(z) = %f\n",
                      sum7/count, sum8/count, sum9/count)
     }'

And yes, you could put it all on one line but it wouldn't be as readable.

I can't answer for why it produced zeros for you; when run on the data from the question, wobbly line starts and all, it produced the output:

avg(x) = 0.257250, avg(y) = 23.736000, avg(z) = 116.620125

If you think there is a possibility of empty input (or, at least, no HETATM records in the input) and an error message is not acceptable, then you can protect the printing action with if (count > 0) or equivalent (added to script). You can generate your own preferred output if ``count` is zero.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203209

$ awk '{for (i=1;i<=3;i++) sum[i]+=$(i+6)}
    END{if (NR) for (i=1;i in sum;i++) print sum[i]/NR}' file
0.25725
23.736
116.62

The if (NR) is necessary to avoid a divide by zero error on an empty file.

If @jaypal is correct and you need to select just the input lines containing HETATM then change it to:

awk '/HETATM/{++nr; for (i=1;i<=3;i++) sum[i]+=$(i+6)}
    END{if (nr) for (i=1;i in sum;i++) print sum[i]/nr}' file

Upvotes: 1

Related Questions