Reputation:
The following AWK script (being a part of the bash code) extracts numbers from selected columns of input.csv as well as do some simple stat operations of these numbers, eventually saving the results as 1 line in output.csv:
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
lowest[suffix] = min # lowest dG
highest[suffix] = fourth # dG in cluster with highest pop
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
min = 0 # lowest value of $3 (assuming all $3 < 0)
max = 0 # highest value of $2 (assuming all $2 > 0)
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
if ($2 > max) {
max = $2 # update popMAX
fourth = $3 # update the value of dG corresponded to topPOP
}
}
END {
if (n) { # just to avoid division by zero
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
lowest[suffix] = min # most negative dG
highest[suffix] = fourth # dG in a cluster with pop(MAX)
}
print "Lig(CNE)", "dG(min)", "dG(popMAX)", "dG(mean)"
for (i in mean)
printf "%s %.2f %.2f %.2f\n", i, lowest[i], highest[i], mean[i]
}' input.csv > output.csv
While operating with input.csv (shown below), it extracts numbers from the third column (dG) of the log: i) detecting the minimal value in the third column (dG(min) which always corresponds to the line with ID=1), as well as the number of the dG corresponded to the maximal number in the second column (POPmax):
# input.csv from the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # this is dG min to be extracted
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200 # this is dG corresponded to pop(MAX) to be extracted
finally it saves the results in another multi-column output.csv file, contained a part of the name of each processed CSV (with corresponded prefix used as the ID of the line), as well as information regarding its dG(min), dG(popMAX) as well as mean value calculated for all numbers in 3rd (dG) column:
# output.csv
Lig(CNE) dG(min) dG(popMAX) dG(mean)
lig12 -5.65 -4.12 −5.055
So dG(min) is the number of $2 (dG) from the line with ID=1 of input.csv (lowest dG) and dG(popMAX) correspond to the value dG detected in the line, which has highest value in $2 (POP)
I need to modify the AWK part of my script adding two additional columns to the output.csv with the information regarding 2nd column of input.csv (POP) for each of the corresponded dG value (the has been taken from the 3rd column of the same log). So the same log should be like this
# output.csv
Lig(CNE). dG(min) POP(min) dG(popMAX) POP(max) dG(mean)
lig12 -5.65 (142) -4.12 (150) −5.055
In other worlds, additionally to the operations performed on the 3rd column, I need to consider the numbers from the 2nd column and then match both of them in the output.csv: so the POP(min) should be taken from the $2 of the first line (with dG(min) ) and POP(max) from $2 of the line with dG(popMAX).
I have tried to define a 2nd column infiormation using
'{print $2}'
but the resulted output.csv did not match the order of the lines of the original input.csv (for instance it took the 2nd column from the line not belonged to dG(min) etc)
Upvotes: 0
Views: 238
Reputation: 22012
Would you please try:
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
var = s2 / n - m * m # variance
if (var < 0) var = 0 # avoid an exception due to round-off error
mean[suffix] = m # store the mean in an array
rmsd[suffix] = sqrt(var)
lowest[suffix] = min # lowest dG
highest[suffix] = fourth # dG in cluster with highest pop
pop_min[suffix] = popmin # pop in cluster with lowest dG
pop_max[suffix] = max # highest pop
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of $3
s2 = 0 # sum of $3 ** 2
n = 0 # count of samples
min = 0 # lowest value of $3 (assuming all $3 < 0)
max = 0 # highest value of $2 (assuming all $2 > 0)
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) {
min = $3 # update the lowest value
popmin = $2 # newly introduced variable
}
if ($2 > max) {
max = $2 # update popMAX
fourth = $3 # update the value of dG corresponded to topPOP
}
}
END {
if (n) { # just to avoid division by zero
m = s / n
var = s2 / n - m * m
if (var < 0) var = 0
mean[suffix] = m
rmsd[suffix] = sqrt(var)
lowest[suffix] = min # most negative dG
highest[suffix] = fourth # dG in a cluster with pop(MAX)
pop_min[suffix] = popmin # pop in cluster with lowest dG
pop_max[suffix] = max # highest pop
}
print "Lig(CNE)", "dG(min)", "POP(dGmin)", "dG(popMAX)", "POP(max)", "dG(mean)"
for (i in mean)
printf "%s %.2f (%d) %.2f (%d) %.2f\n", i, lowest[i], pop_min[i], highest[i], pop_max[i], mean[i]
}' input.csv
dG(popMAX)
is already assigned to
the existing variable max
.popmin
has been introduced to hold the pop value
associated with the lowest dG. The variable is updated whenever min
(the lowest dG) is updated.BTW the variable names such as min
or max
are getting less
self-explanatory due to the repeated extensions. It may be better to
rename them and/or refactor the code for future maintainability.
Upvotes: 1