awk: manipulations with multi-column data

Question

The following AWK script (being a part of the bash code) extracts numbers from selected columns of input.csv as well as do some simple stat operations of these numbers, eventually saving the results as 1 line in output.csv:

awk -F ", *" '                  # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      var = s2 / n - m * m      # variance
      if (var < 0) var = 0      # avoid an exception due to round-off error
      mean[suffix] = m          # store the mean in an array
      rmsd[suffix] = sqrt(var)
      lowest[suffix] = min      # lowest dG
      highest[suffix] = fourth  # dG in cluster with highest pop
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(//[^/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of $3
   s2 = 0                       # sum of $3 ** 2
   n = 0                        # count of samples
   min = 0                      # lowest value of $3 (assuming all $3 < 0)
   max = 0                      # highest value of $2 (assuming all $2 > 0)
}
FNR > 1 {
   s += $3
   s2 += $3 * $3
   ++n
   if ($3 < min) min = $3       # update the lowest value
   if ($2 > max) {
      max = $2                  # update popMAX
      fourth = $3               # update the value of dG corresponded to topPOP
   }
}
END {
  if (n) {                     # just to avoid division by zero
   m = s / n
   var = s2 / n - m * m
   if (var < 0) var = 0
   mean[suffix] = m
   rmsd[suffix] = sqrt(var)
   lowest[suffix] = min     # most negative dG
   highest[suffix] = fourth  # dG in a cluster with pop(MAX)
  }
   print "Lig(CNE)", "dG(min)", "dG(popMAX)", "dG(mean)"
   for (i in mean)
      printf "%s %.2f %.2f %.2f
", i, lowest[i],  highest[i], mean[i]
}'  input.csv > output.csv

While operating with input.csv (shown below), it extracts numbers from the third column (dG) of the log: i) detecting the minimal value in the third column (dG(min) which always corresponds to the line with ID=1), as well as the number of the dG corresponded to the maximal number in the second column (POPmax):

# input.csv from the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # this is dG min to be extracted
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200 # this is dG corresponded to pop(MAX) to be extracted

finally it saves the results in another multi-column output.csv file, contained a part of the name of each processed CSV (with corresponded prefix used as the ID of the line), as well as information regarding its dG(min), dG(popMAX) as well as mean value calculated for all numbers in 3rd (dG) column:

# output.csv
Lig(CNE)    dG(min) dG(popMAX) dG(mean)
lig12       -5.65   -4.12     −5.055

So dG(min) is the number of $2 (dG) from the line with ID=1 of input.csv (lowest dG) and dG(popMAX) correspond to the value dG detected in the line, which has highest value in $2 (POP)

I need to modify the AWK part of my script adding two additional columns to the output.csv with the information regarding 2nd column of input.csv (POP) for each of the corresponded dG value (the has been taken from the 3rd column of the same log). So the same log should be like this

# output.csv
Lig(CNE).   dG(min) POP(min)    dG(popMAX) POP(max) dG(mean)
lig12       -5.65   (142)       -4.12      (150)    −5.055

In other worlds, additionally to the operations performed on the 3rd column, I need to consider the numbers from the 2nd column and then match both of them in the output.csv: so the POP(min) should be taken from the $2 of the first line (with dG(min) ) and POP(max) from $2 of the line with dG(popMAX).

I have tried to define a 2nd column infiormation using

'{print $2}'

but the resulted output.csv did not match the order of the lines of the original input.csv (for instance it took the 2nd column from the line not belonged to dG(min) etc)

tshiono · Accepted Answer

Would you please try:

awk -F ", *" '                  # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      var = s2 / n - m * m      # variance
      if (var < 0) var = 0      # avoid an exception due to round-off error
      mean[suffix] = m          # store the mean in an array
      rmsd[suffix] = sqrt(var)
      lowest[suffix] = min      # lowest dG
      highest[suffix] = fourth  # dG in cluster with highest pop
      pop_min[suffix] = popmin  # pop in cluster with lowest dG
      pop_max[suffix] = max     # highest pop
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(//[^/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of $3
   s2 = 0                       # sum of $3 ** 2
   n = 0                        # count of samples
   min = 0                      # lowest value of $3 (assuming all $3 < 0)
   max = 0                      # highest value of $2 (assuming all $2 > 0)
}
FNR > 1 {
   s += $3
   s2 += $3 * $3
   ++n
   if ($3 < min) {
      min = $3                  # update the lowest value
      popmin = $2               # newly introduced variable
   }
   if ($2 > max) {
      max = $2                  # update popMAX
      fourth = $3               # update the value of dG corresponded to topPOP
   }
}
END {
   if (n) {                     # just to avoid division by zero
      m = s / n
      var = s2 / n - m * m
      if (var < 0) var = 0
      mean[suffix] = m
      rmsd[suffix] = sqrt(var)
      lowest[suffix] = min      # most negative dG
      highest[suffix] = fourth  # dG in a cluster with pop(MAX)
      pop_min[suffix] = popmin  # pop in cluster with lowest dG
      pop_max[suffix] = max     # highest pop
   }
      print "Lig(CNE)", "dG(min)", "POP(dGmin)", "dG(popMAX)", "POP(max)", "dG(mean)"
   for (i in mean)
      printf "%s %.2f (%d) %.2f (%d) %.2f
", i, lowest[i], pop_min[i], highest[i], pop_max[i], mean[i]
}' input.csv

The highest pop associated with dG(popMAX) is already assigned to the existing variable max.
The variable popmin has been introduced to hold the pop value associated with the lowest dG. The variable is updated whenever min (the lowest dG) is updated.

BTW the variable names such as min or max are getting less self-explanatory due to the repeated extensions. It may be better to rename them and/or refactor the code for future maintainability.

awk: manipulations with multi-column data

Answers (1)

Related Questions