Reputation: 11
I'm very much new to using awk, but I found lots of help in Fredrik Pihl's answer to this question on how to calculate the mean of a field ($3) over many records that share another field ($1):
Question: awk average part of column if lines (specific field) match
Input sample:
$cat NDVI-bm
P01 031.RAW 0.516 0 0
P01 021.RAW 0.449 0 0
P02 045.RAW 0.418 0 0
P03 062.RAW 0.570 0 0
P03 064.RAW 0.469 0 0
P04 083.RAW 0.636 0 0
P04 081.RAW 0.592 0 0
P04 082.RAW 0.605 0 0
P04 084.RAW 0.648 0 0
P05 093.RAW 0.748 0 0
Fredrik Pihl's answer:
{
sum[$1]+=$3
cnt[$1]++
}
END {
print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
for (i in sum)
print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i]
}
However, I got stuck also trying to calculate the variance over the matching fields (the sum of squares of differences between every value and the mean, divided by the count). I think I may need a way to calculate the mean already before the END structure for each matching records, or if it is possible to do the entire variance calculation within the END structure but then I somehow need to retrieve the original values of $3. I'm not sure how to do either. Thanks for any hints.
Upvotes: 1
Views: 262
Reputation: 1
I think that in your awk script, groups[$3]=$1 will not work correctly if several raws have same value in column 3, because these N same values be counted only once in groups.
Upvotes: 0
Reputation: 3756
Code for GNU awk:
{
sum[$1]+=$3
count[$1]++
groups[$3]=$1
}
END {
for (i in sum) mean[i]=sum[i]/count[i]
for (i in groups) meandiff[i]=i-mean[groups[i]]
for (i in groups) sumdiff2[groups[i]]+=meandiff[i]^2
for (i in sumdiff2) var[i]=sumdiff2[i]/count[i]
for (i in var)
print "group:", i, "count:", count[i], "\tmean:", mean[i], "\tsum:", sum[i], "\tsumdiff^2:", sumdiff2[i], "\t\tvariance:", var[i]
}
$cat file P01 031.RAW 0.516 0 0 P01 021.RAW 0.449 0 0 P02 045.RAW 0.418 0 0 P03 062.RAW 0.570 0 0 P03 064.RAW 0.469 0 0 P04 083.RAW 0.636 0 0 P04 081.RAW 0.592 0 0 P04 082.RAW 0.605 0 0 P04 084.RAW 0.648 0 0 P05 093.RAW 0.748 0 0 $awk -f prog.awk file group: P01 count: 2 mean: 0.4825 sum: 0.965 sumdiff^2: 0.0022445 variance: 0.00112225 group: P02 count: 1 mean: 0.418 sum: 0.418 sumdiff^2: 0 variance: 0 group: P03 count: 2 mean: 0.5195 sum: 1.039 sumdiff^2: 0.0051005 variance: 0.00255025 group: P04 count: 4 mean: 0.62025 sum: 2.481 sumdiff^2: 0.00204875 variance: 0.000512188 group: P05 count: 1 mean: 0.748 sum: 0.748 sumdiff^2: 0 variance: 0
Upvotes: 2
Reputation: 7792
You can calculate the variance at the end by calculating the sum and sum of squares of the samples.
Then
variance = (Sum of squares - (Sum*Sum)/n)/n
so
{
sum[$1]+=$3
sum_squares[$1]+=$3*$3
cnt[$1]++
}
END {
print "Name" "\t" "sum" "\t" "cnt" "\t" "avg" "\t" "var"
for (i in sum)
print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i] "\t" (sum_squares[i] - (sum[i]*sum[i])/cnt[i])/cnt[i]
}
To select a particular pattern, add it to the start of the summing calculations (note END matches the end of the file) e.g.
/P03/ {
sum[$1]+=$3
sum_squares[$1]+=$3*$3
cnt[$1]++
}
Now only lines with P03 in them will be processed
Upvotes: 0