Reputation: 523
I am working on the vizualisation of the data provided in the following multi-column format:
| Run on Thu Oct 20 14:59:37 2022
|| GB non-polar solvation energies calculated with gbsa=2
idecomp = 1: Per-residue decomp adding 1-4 interactions to Internal.
Energy Decomposition Analysis (All units kcal/mol): Generalized Born solvent
DELTAS:
Total Energy Decomposition:
Residue | Location | Internal | van der Waals | Electrostatic | Polar Solvation | Non-Polar Solv. | TOTAL
-------------------------------------------------------------------------------------------------------------------------------------------------------
SER 1 | R SER 1 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.092 +/- 0.012 | 0.092 +/- 0.012 | 0.000 +/- 0.000 | 0.000 +/- 0.001
GLY 2 | R GLY 2 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | 0.001 +/- 0.001 | -0.001 +/- 0.001 | 0.000 +/- 0.000 | 0.000 +/- 0.001
PHE 3 | R PHE 3 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.003 +/- 0.001 | 0.004 +/- 0.001 | 0.000 +/- 0.000 | 0.000 +/- 0.001
ARG 4 | R ARG 4 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.160 +/- 0.025 | 0.164 +/- 0.025 | 0.000 +/- 0.000 | 0.003 +/- 0.001
LYS 5 | R LYS 5 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.211 +/- 0.038 | 0.230 +/- 0.038 | 0.000 +/- 0.000 | 0.019 +/- 0.004
MET 6 | R MET 6 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.006 +/- 0.003 | 0.010 +/- 0.003 | 0.000 +/- 0.000 | 0.004 +/- 0.001
ALA 7 | R ALA 7 | 0.000 +/- 0.000 | -0.000 +/- 0.000 | -0.019 +/- 0.003 | 0.023 +/- 0.003 | 0.000 +/- 0.000 | 0.003 +/- 0.001
I need to reduce this multi-column's to two column format for the subsequent ploting of the bar graph. In particularly I need to take only the 1st and last columns with some modifications of the initial data.
in the indexes from the first (Residue) column all spaces should be substituted to "_" so SER 1
should become SER_1
.
from the last (TOTAL) column the info regarding the error should be eliminated so 0.000 +/- 0.001
should become 0.000
etc..
I've tried to use AWK to select these two columns and additionally apply the condition on the second column. However, owing to the presence of the error info the conditions could not be applied.
awk -F "|" '
BEGIN {
print "@TYPE bar"
}
NR > 9 && $8 > 0.005 { print $1, $8 }
' $file > output_data.dat
which gives me something like this:
@TYPE bar
SER 1 0.000 +/- 0.001
GLY 2 0.000 +/- 0.001
PHE 3 0.000 +/- 0.001
ARG 4 0.003 +/- 0.001
LYS 5 0.019 +/- 0.004
MET 6 0.004 +/- 0.001
ALA 7 0.003 +/- 0.001
PHE 8 0.001 +/- 0.001
PRO 9 0.004 +/- 0.001
SER 10 -0.007 +/- 0.002
GLY 11 0.001 +/- 0.001
LYS 12 0.006 +/- 0.002
VAL 13 -0.010 +/- 0.009
GLU 14 0.002 +/- 0.005
GLY 15 0.004 +/- 0.002
CYS 16 -0.006 +/- 0.005
MET 17 0.023 +/- 0.010
VAL 18 -0.058 +/- 0.018
while the expected output would be:
@TYPE bar
SER_1 0.000
GLY_2 0.000
PHE_3 0.000
ARG_4 0.003
LYS_5 0.019
...
VAL_18 -0.058
How could I additionally modify format of the data in the selected columns ?
Upvotes: 0
Views: 44
Reputation: 34134
Need a bit more parsing/manipulation of the input fields, eg:
awk -F'|' '
BEGIN { print "@TYPE bar" }
NF==8 { gsub(/^[[:space:]]+|[[:space:]]+$/,"",$1) # strip leading/trailing spaces from 1st field
gsub(/[[:space:]]+/,"_",$1) # convert all contiguous spaces to a single '_'
gsub(/^[[:space:]]+|[[:space:]]+$/,"",$8) # strip leading/trailing spaces from 8th field
split($8,a,"[[:space:]]") # split 8th field on white space
if (a[1]+0 == a[1] && a[1] > 0.005) # if 1st sub-field is numeric and > 0.005 then ...
print $1,a[1] # print to stdout
}
' data
NOTES:
a[1]+0 == a[1]
- shorthand method of testing if a[1]
is numeric (ie, eliminates the header line where a[1]=="TOTAL"
)a[1] > 0.005
- from OP's current awk
code, though this comparison does not appear to have been applied in OP's expected output; remove this comparison if all lines should be displayedThis generates:
@TYPE bar
LYS_5 0.019
If we remove the comparison (&& a[1] > 0.005
) this generates:
@TYPE bar
SER_1 0.000
GLY_2 0.000
PHE_3 0.000
ARG_4 0.003
LYS_5 0.019
MET_6 0.004
ALA_7 0.003
Upvotes: 1