Reputation: 25
I have a file like this:
Score 1 24 HG 1
Score 2 26 HG 2
Score 5 56 RP 0.5
Score 7 82 RP 1
Score 12 97 GM 5
Score 32 104 LS 3
I would like to average column 5 if column 4 are identical and print the average as column 6 so that it looks like this:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
I have tried a couple of solutions I found on here. e.g.
awk '{ total[$4] += $5; ++n[$4] } END { for(i in total) print i, total[i] / n[i] }'
but they all end up with this:
HG 1.5
RP 0.75
GM 5
LS 3
Which is undesirable as I lose a lot of information.
Upvotes: 1
Views: 176
Reputation: 18697
You can iterate through your table twice: calculate the averages (as you already) do on the first iteration, and then print them out on the second iteration:
awk 'NR==FNR { total[$4] += $5; ++n[$4] } NR>FNR { print $0, total[$4] / n[$4] }' file file
Notice the file
twice at the end. While going through the "first" file, NR==FNR
, and we sum the appropriate values, keeping them in memory (variables total
and n
). During "second" file traversal, NR>FNR
, and we print out all the original data + averages:
Score 1 24 HG 1 1.5
Score 2 26 HG 2 1.5
Score 5 56 RP 0.5 0.75
Score 7 82 RP 1 0.75
Score 12 97 GM 5 5
Score 32 104 LS 3 3
Upvotes: 2
Reputation: 246807
You can use 1 pass through the file, but you have to store in memory the entire file, so disk i/o vs memory tradeoff:
awk '
BEGIN {FS = OFS = "\t"}
{total[$4] += $5; n[$4]++; line[NR] = $0; key[NR] = $4}
END {for (i=1; i<=NR; i++) print line[i], total[key[i]] / n[key[i]]}
' file
Upvotes: 1