user14748664
user14748664

Reputation:

AWK: simple math operations on multi-column data with subsequent data conversion

As a part of my Bash routine, I am dealing with the directory consisted of many folders, with the following naming patters:

1000_cne_lig1, 1000_cne_lig2, 1000_cne_lig3, 1000_cne_lig4, 1000_cne_lig5  ... 1000_cne_ligN
2000_cne_lig1, 2000_cne_lig2, 2000_cne_lig3, 2000_cne_lig4, 2000_cne_lig5  ... 2000_cne_ligN
3000_cne_lig1, 3000_cne_lig2, 3000_cne_lig3, 4000_cne_lig4, 5000_cne_lig5  ... 3000_cne_ligN
7000_cne_lig1, 7000_cne_lig2, 7000_cne_lig3, 7000_cne_lig4, 7000_cne_lig5  ... 4000_cne_ligN
...
xxxx_cne_lig1, xxxx_cne_lig2, xxxx_cne_lig3, xxxx_cne_lig4, xxxx_cne_lig5  ... xxxx_cne_ligN

Note, that all of the folders can be grouped into X categories (1000,2000 ... xxxx) according to the name of the system, which is defined by the pattern occured before the first slash "_" in the name of each folder. Inside each of the folder there is a CSV file contained data arranged in multi-line format:

ID, POP, dG
1, 40, -5.7600
2, 2, -5.4000
3, 8, -5.3300

I need to iteratively loop over the folders belonged to distinct system (e.g. for 5 folders of 1000, then for 5 folders of 2000 etc) and detect the CSV file. Then I need to carry out some simple math operations on each of the CSV log (for particular system): calculate the mean for the negative numbers (in third column of the CSV) and save it in a new file containing the name of the system (e.g. 1000.csv) in one line containing: the name of the particular folder, the mean value. For example for system 1000, the 1000.csv should be:

# system 1000; dG(mean)
lig1: -5.555
lig2: -6.003
lig3: -3.031
lig4: -3.222
lig5: -10.300
ligN: -NN.NNN

Note, I removed 1000_cne_ in the each of the line (name of the original file) but add it to the head of the CSV.

Finally for X systems, the script should produce X new CSV filles (1000.csv, 2000.csv, XXXX.csv etc) contained N lines according to the number of the folders.

Here is the practical realisation of the bash routine, which already classify the folders and then should be completed by AWK which will do all math and transfer of the computed mean values to new CSV:

#!/bin/bash
home=$PWD
# folder with the folders to analyse
storage="${home}"/results
# folder with the outputs
rescore="${home}"/rescore 
# pattern to recognize csv file for analysis
csv_pattern='*_filt3b.csv'



# this will iteratively do something on the group of the folders belonged to one syst
for folder in "${storage}"/*; do
# this is the name of each folder
folder_name=$(basename "$folder")
# detect the name of the system (X) determined by 4 characters near the first _ >> this is the name of output.csv
syst_name=$(basename "$folder" | cut -d'_' -f 1)
# detect the name of the sample (N) the last entry after the last _ >> the name of the lines in new CSV
sample_name=$(basename "$folder" | cut -d'_' -f 3)
pushd ${folder}
# apply AWK on each CSV to calculate MEAN and store it in new output.csv :
    awk 'FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s+=$3
   ++n
}
END {
   mean[suffix] = s/n
   print "# system", prefix, "; dG(mean)"
   for (i in mean)
      print i ":", mean[i]
}' "${folder}"/*filt3b.csv >> ${rescore}/${syst_name}.csv  
    popd
done

This gives me the following output.csv (for ten processed CSV filles of the 1000 system):

# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -6.44
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.59
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.96
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.17
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.73
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.04
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -6.625
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -2
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.34
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -8.14

Indicating that there is some mismatch between BASH and AWK parts: in the output filt3b.csv should be replaced by the part of the name (like lig1 etc) of the FOLDER contained this filt3b.csv. Also the path /Users/gleb/Desktop/scripts/analys should not be present but rather replaced by the name of the system (like 1000, occured always at the bottom line of the final output). Finally the numbers of mean values should be provided in the -X.XX format, like -4.59 (not -X or -X.XXX)

Updated: find a possibility to fix a problem with the header of output.csv, via intdoducing it by ECHO before AWK processing within the same script:

home=$PWD
# folder with the results
storage="${home}"/results
tmp="${home}"/tmp
rescore="${home}"/rescore

# csv for rescoring
csv_pattern='*_filt3b.csv'
    
if [ -d "${rescore}" ]; then
  rm -rf "${rescore}"
  mkdir "${rescore}"
  else
  mkdir "${rescore}"
fi


for folder in "${storage}"/*; do
# this is the name of each folder
folder_name=$(basename "$folder")
# detect the name of the system (X) determined by 4 characters near the first _ >> this is the name of output.csv
syst_name=$(basename "$folder" | cut -d'_' -f 1)
# detect the name of the sample (N) the last entry after the last _ >> the name of the lines in new CSV
sample_name=$(basename "$folder" | cut -d'_' -f 3)
pushd $folder
# a simple example of output format w/o calculations
if [ ! -f ${rescore}/${syst_name}.csv ]; then
# add the header to the CSV contained name of the system
echo "${syst_name}: dG(rescored)" > ${rescore}/${syst_name}.csv
fi
# apply AWK on each CSV to calculate mean for the numbers in third column 
awk 'FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s+=$3
   ++n
}
END {
   mean[suffix] = s/n
   #print "# system", prefix, "; dG(mean)"
   for (i in mean)
      print i ":", mean[i]
}' ${folder}/${csv_pattern} >> ${rescore}/${syst_name}.csv
popd
done

which gives 1000.csv for 10 folders like 1000_cne_ligN:

1000: dG(rescored)
filt3b.csv: -6.3825
filt3b.csv: -4.455
filt3b.csv: -5.28
filt3b.csv: -5.76
filt3b.csv: -5.52
filt3b.csv: -3.92
filt3b.csv: -7.505
filt3b.csv: -1.8
filt3b.csv: -5.79
filt3b.csv: -5.61

Upvotes: 1

Views: 364

Answers (1)

anubhava
anubhava

Reputation: 785058

You may use this single awk script for this which is POSIX compliant as well:

awk 'FNR==1 {if (n) mean[suffix] = s/n; prefix=suffix=FILENAME; sub(/_.*/, "", prefix); sub(/^.*_/, "", suffix); s=n=0} FNR > 1 {s+=$3; ++n} END {mean[suffix] = s/n; print "# system", prefix, "; dG(mean)"; for (i in mean) print i ":", mean[i]}' 1000_*

# system 1000 ; dG(mean)
lig1: -5.49667
lig2: -6.76333

Based on your edited question here is full shell script:

#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
cd "${home}"/results
# pattern to recognize csv file for analysis
csv_pattern='*_filt3b.csv'

while read -r d; do
awk '
FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s += $3
   ++n
}
END {
   mean[suffix] = s/n
   print "# system", prefix, "; dG(mean)"
   for (i in mean)
      printf "%s: %.2f\n", i, mean[i]
}' "${d}_"*/$csv_pattern | sort > "$rescore/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')

Upvotes: 3

Related Questions