James Starlight
James Starlight

Reputation: 523

awk: pre-processing of the data in the multi-column data

I am working on the vizualisation of the data provided in the following multi-column format:

| Run on Thu Oct 20 14:59:37 2022
|| GB non-polar solvation energies calculated with gbsa=2
idecomp = 1: Per-residue decomp adding 1-4 interactions to Internal.
Energy Decomposition Analysis (All units kcal/mol): Generalized Born solvent

DELTAS:
Total Energy Decomposition:
Residue |  Location |       Internal      |    van der Waals    |    Electrostatic    |   Polar Solvation   |    Non-Polar Solv.  |       TOTAL
-------------------------------------------------------------------------------------------------------------------------------------------------------
SER   1 | R SER   1 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.092 +/-  0.012 |    0.092 +/-  0.012 |    0.000 +/-  0.000 |    0.000 +/-  0.001
GLY   2 | R GLY   2 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |    0.001 +/-  0.001 |   -0.001 +/-  0.001 |    0.000 +/-  0.000 |    0.000 +/-  0.001
PHE   3 | R PHE   3 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.003 +/-  0.001 |    0.004 +/-  0.001 |    0.000 +/-  0.000 |    0.000 +/-  0.001
ARG   4 | R ARG   4 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.160 +/-  0.025 |    0.164 +/-  0.025 |    0.000 +/-  0.000 |    0.003 +/-  0.001
LYS   5 | R LYS   5 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.211 +/-  0.038 |    0.230 +/-  0.038 |    0.000 +/-  0.000 |    0.019 +/-  0.004
MET   6 | R MET   6 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.006 +/-  0.003 |    0.010 +/-  0.003 |    0.000 +/-  0.000 |    0.004 +/-  0.001
ALA   7 | R ALA   7 |    0.000 +/-  0.000 |   -0.000 +/-  0.000 |   -0.019 +/-  0.003 |    0.023 +/-  0.003 |    0.000 +/-  0.000 |    0.003 +/-  0.001

I need to reduce this multi-column's to two column format for the subsequent ploting of the bar graph. In particularly I need to take only the 1st and last columns with some modifications of the initial data.

  1. in the indexes from the first (Residue) column all spaces should be substituted to "_" so SER 1 should become SER_1.

  2. from the last (TOTAL) column the info regarding the error should be eliminated so 0.000 +/- 0.001 should become 0.000 etc..

I've tried to use AWK to select these two columns and additionally apply the condition on the second column. However, owing to the presence of the error info the conditions could not be applied.

awk -F "|"  '
    BEGIN {
        print "@TYPE bar"
    }
    NR > 9 && $8 > 0.005 { print $1, $8 }
' $file > output_data.dat

which gives me something like this:

@TYPE bar
SER   1      0.000 +/-  0.001
GLY   2      0.000 +/-  0.001
PHE   3      0.000 +/-  0.001
ARG   4      0.003 +/-  0.001
LYS   5      0.019 +/-  0.004
MET   6      0.004 +/-  0.001
ALA   7      0.003 +/-  0.001
PHE   8      0.001 +/-  0.001
PRO   9      0.004 +/-  0.001
SER  10     -0.007 +/-  0.002
GLY  11      0.001 +/-  0.001
LYS  12      0.006 +/-  0.002
VAL  13     -0.010 +/-  0.009
GLU  14      0.002 +/-  0.005
GLY  15      0.004 +/-  0.002
CYS  16     -0.006 +/-  0.005
MET  17      0.023 +/-  0.010
VAL  18     -0.058 +/-  0.018

while the expected output would be:

@TYPE bar
SER_1      0.000
GLY_2      0.000
PHE_3      0.000
ARG_4      0.003
LYS_5      0.019
...
VAL_18     -0.058

How could I additionally modify format of the data in the selected columns ?

Upvotes: 0

Views: 44

Answers (1)

markp-fuso
markp-fuso

Reputation: 34134

Need a bit more parsing/manipulation of the input fields, eg:

awk -F'|' '
BEGIN { print "@TYPE bar" }
NF==8 { gsub(/^[[:space:]]+|[[:space:]]+$/,"",$1)        # strip leading/trailing spaces from 1st field
        gsub(/[[:space:]]+/,"_",$1)                      # convert all contiguous spaces to a single '_'

        gsub(/^[[:space:]]+|[[:space:]]+$/,"",$8)        # strip leading/trailing spaces from 8th field
        split($8,a,"[[:space:]]")                        # split 8th field on white space

        if (a[1]+0 == a[1] && a[1] > 0.005)              # if 1st sub-field is numeric and > 0.005 then ... 
           print $1,a[1]                                 # print to stdout
      }
' data

NOTES:

  • a[1]+0 == a[1] - shorthand method of testing if a[1] is numeric (ie, eliminates the header line where a[1]=="TOTAL")
  • a[1] > 0.005 - from OP's current awk code, though this comparison does not appear to have been applied in OP's expected output; remove this comparison if all lines should be displayed

This generates:

@TYPE bar
LYS_5 0.019

If we remove the comparison (&& a[1] > 0.005) this generates:

@TYPE bar
SER_1 0.000
GLY_2 0.000
PHE_3 0.000
ARG_4 0.003
LYS_5 0.019
MET_6 0.004
ALA_7 0.003

Upvotes: 1

Related Questions