Reputation: 2097
I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.
#Raw data
4 compiler-compiler 100000 99975 1 25
4 compiler-compiler 100000 99683 13 24.3846
4 compiler-compiler 100000 93649 457 13.8972
4 compiler-compiler 100000 99764 177 1.33333
4 compiler-compiler 100000 99999 1 1
4 compiler-sunflow 100000 99999 1 1
4 compiler-sunflow 100000 99674 11 29.6364
4 compiler-sunflow 100000 93467 423 15.4444
4 compiler-sunflow 100000 99694 159 1.92453
4 compiler-sunflow 100000 99938 4 15.5
4 compress 100000 99997 1 3
4 compress 100000 99653 10 34.7
4 compress 100000 93639 454 14.011
4 compress 100000 99666 173 1.93064
4 compress 100000 99978 4 5.5
4 serial 100000 99998 1 2
4 serial 100000 99932 6 11.3333
4 serial 100000 93068 460 15.0696
4 serial 100000 99264 206 3.57282
4 serial 100000 99997 3 1
4 sunflow 100000 99998 1 2
4 sunflow 100000 99546 18 25.2222
4 sunflow 100000 93387 481 13.7484
4 sunflow 100000 99752 189 1.31217
4 sunflow 100000 99974 4 6.5
4 xml-transfomer 100000 99994 1 6
4 xml-transfomer 100000 99964 3 12
4 xml-transfomer 100000 93621 463 13.7775
4 xml-transfomer 100000 99540 199 2.31156
4 xml-transfomer 100000 99986 2 7
4 xml-validation 100000 99996 1 4
4 xml-validation 100000 99563 16 27.3125
4 xml-validation 100000 93748 451 13.8625
4 xml-validation 100000 99716 190 1.49474
4 xml-validation 100000 99979 3 7
#Output data
compiler-compiler value1
....
xml-transfomer value2
xml-validation value3
I think the comment awk & sed can work for this, but i do not know how get it.
Upvotes: 0
Views: 152
Reputation: 157992
sed
cannot being used here since it does not support math operations. It's a job for awk
:
awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt
Explanation:
NR>1 { c[$2]++; s[$2+=($NF) }
NR>1
means that the following block gets executed on all lines except of the first line. $2
is the value of the second column. NF
is the number of fields per line. $(NF)
contains the value of the last column. c
and s
are assoc arrays. c
counts the occurrences of $2
, c
stores a total of the numeric value in the last column - grouped by $2
.
END {for(i in c){print i,s[i]/c[i]}}
END
means the following action will take place after the last line of input has been processed. The for
loop iterates through c
and outputs the name and the mean for all indexes in c
.
Output:
xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283
Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:
awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'
This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2
Upvotes: 2