Reputation: 11

awk variance on part of records if lines of specific field match

I'm very much new to using awk, but I found lots of help in Fredrik Pihl's answer to this question on how to calculate the mean of a field ($3) over many records that share another field ($1):

Question: awk average part of column if lines (specific field) match

Input sample:

$cat NDVI-bm  
P01 031.RAW 0.516 0 0  
P01 021.RAW 0.449 0 0  
P02 045.RAW 0.418 0 0  
P03 062.RAW 0.570 0 0  
P03 064.RAW 0.469 0 0  
P04 083.RAW 0.636 0 0  
P04 081.RAW 0.592 0 0  
P04 082.RAW 0.605 0 0  
P04 084.RAW 0.648 0 0  
P05 093.RAW 0.748 0 0

Fredrik Pihl's answer:

{
    sum[$1]+=$3
    cnt[$1]++
}


END {
    print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
    for (i in sum)
        print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i]

}

However, I got stuck also trying to calculate the variance over the matching fields (the sum of squares of differences between every value and the mean, divided by the count). I think I may need a way to calculate the mean already before the END structure for each matching records, or if it is possible to do the entire variance calculation within the END structure but then I somehow need to retrieve the original values of $3. I'm not sure how to do either. Thanks for any hints.

Upvotes: 1

Answers (3)

cat.b

Reputation: 1

I think that in your awk script, groups[$3]=$1 will not work correctly if several raws have same value in column 3, because these N same values be counted only once in groups.

Upvotes: 0

captcha

Reputation: 3756

Code for GNU awk:

{
    sum[$1]+=$3
    count[$1]++
    groups[$3]=$1
}

END {
    for (i in sum) mean[i]=sum[i]/count[i]
    for (i in groups) meandiff[i]=i-mean[groups[i]]
    for (i in groups) sumdiff2[groups[i]]+=meandiff[i]^2
    for (i in sumdiff2) var[i]=sumdiff2[i]/count[i]
    for (i in var)
        print "group:", i, "count:", count[i], "\tmean:", mean[i], "\tsum:", sum[i], "\tsumdiff^2:", sumdiff2[i], "\t\tvariance:", var[i]
}

$cat file
P01 031.RAW 0.516 0 0
P01 021.RAW 0.449 0 0
P02 045.RAW 0.418 0 0
P03 062.RAW 0.570 0 0
P03 064.RAW 0.469 0 0
P04 083.RAW 0.636 0 0
P04 081.RAW 0.592 0 0
P04 082.RAW 0.605 0 0
P04 084.RAW 0.648 0 0
P05 093.RAW 0.748 0 0

$awk -f prog.awk file
group: P01 count: 2     mean: 0.4825    sum: 0.965      sumdiff^2: 0.0022445    variance: 0.00112225
group: P02 count: 1     mean: 0.418     sum: 0.418      sumdiff^2: 0            variance: 0
group: P03 count: 2     mean: 0.5195    sum: 1.039      sumdiff^2: 0.0051005    variance: 0.00255025
group: P04 count: 4     mean: 0.62025   sum: 2.481      sumdiff^2: 0.00204875   variance: 0.000512188
group: P05 count: 1     mean: 0.748     sum: 0.748      sumdiff^2: 0            variance: 0

Upvotes: 2

parkydr

Reputation: 7792

You can calculate the variance at the end by calculating the sum and sum of squares of the samples.

Then

variance = (Sum of squares - (Sum*Sum)/n)/n

{
    sum[$1]+=$3
    sum_squares[$1]+=$3*$3
    cnt[$1]++
}


END {
    print "Name" "\t" "sum" "\t" "cnt" "\t" "avg" "\t" "var"
    for (i in sum)
        print i "\t" sum[i] "\t" cnt[i] "\t" sum[i]/cnt[i] "\t" (sum_squares[i] - (sum[i]*sum[i])/cnt[i])/cnt[i]

}

To select a particular pattern, add it to the start of the summing calculations (note END matches the end of the file) e.g.

/P03/ {
    sum[$1]+=$3
    sum_squares[$1]+=$3*$3
    cnt[$1]++
}

Now only lines with P03 in them will be processed

Upvotes: 0

awk variance on part of records if lines of specific field match

Answers (3)

Related Questions