How to use awk and sed to make statistics on a data

Question

I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.

 #Raw data
 4 compiler-compiler  100000 99975  1  25
  4 compiler-compiler  100000 99683  13  24.3846
  4 compiler-compiler  100000 93649  457  13.8972
  4 compiler-compiler  100000 99764  177  1.33333
  4 compiler-compiler  100000 99999  1  1
  4 compiler-sunflow  100000 99999  1  1
  4 compiler-sunflow  100000 99674  11  29.6364
  4 compiler-sunflow  100000 93467  423  15.4444
  4 compiler-sunflow  100000 99694  159  1.92453
  4 compiler-sunflow  100000 99938  4  15.5
  4 compress  100000 99997  1  3
  4 compress  100000 99653  10  34.7
  4 compress  100000 93639  454  14.011
  4 compress  100000 99666  173  1.93064
  4 compress  100000 99978  4  5.5
  4 serial  100000 99998  1  2
  4 serial  100000 99932  6  11.3333
  4 serial  100000 93068  460  15.0696
  4 serial  100000 99264  206  3.57282
  4 serial  100000 99997  3  1
  4 sunflow  100000 99998  1  2
  4 sunflow  100000 99546  18  25.2222
  4 sunflow  100000 93387  481  13.7484
  4 sunflow  100000 99752  189  1.31217
  4 sunflow  100000 99974  4  6.5
  4 xml-transfomer  100000 99994  1  6
  4 xml-transfomer  100000 99964  3  12
  4 xml-transfomer  100000 93621  463  13.7775
  4 xml-transfomer  100000 99540  199  2.31156
  4 xml-transfomer  100000 99986  2  7
  4 xml-validation  100000 99996  1  4
  4 xml-validation  100000 99563  16  27.3125
  4 xml-validation  100000 93748  451  13.8625
  4 xml-validation  100000 99716  190  1.49474
  4 xml-validation  100000 99979  3  7

#Output data
compiler-compiler   value1
....
xml-transfomer      value2
xml-validation      value3

I think the comment awk & sed can work for this, but i do not know how get it.

hek2mgl · Accepted Answer

sed cannot being used here since it does not support math operations. It's a job for awk:

awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt

Explanation:

NR>1 { c[$2]++; s[$2+=($NF) }

NR>1 means that the following block gets executed on all lines except of the first line. $2 is the value of the second column. NF is the number of fields per line. $(NF) contains the value of the last column. c and s are assoc arrays. c counts the occurrences of $2, c stores a total of the numeric value in the last column - grouped by $2.

END {for(i in c){print i,s[i]/c[i]}}

END means the following action will take place after the last line of input has been processed. The for loop iterates through c and outputs the name and the mean for all indexes in c.

Output:

xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283

Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:

awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'

This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2

How to use awk and sed to make statistics on a data

Answers (1)

Related Questions