Reputation: 322
I wrote a script for getting the MEAN and the STDEV from a data file. Let's say the data file has this data:
1
2
3
4
5
The awk script looks like this
awk '{MEAN+=$1/5}END{print MEAN, STDEV=sqrt(($1-MEAN)**2/4)}' dat.dat>stat1.dat
but it gives me an incorrect value of STDEV=1. It must be 1.5811. Do you know what is incorrect in my script? how could I improve it?
Upvotes: 2
Views: 72
Reputation: 67497
you can do the same in one pass
$ seq 5 | awk '{sum+=$1; sqsum+=$1^2}
END{mean=sum/NR;
print mean, sqrt((sqsum-NR*mean^2)/(NR-1))}'
3 1.58114
note that this is the std definition for "sample population" (divide by N-1).
Upvotes: 1
Reputation: 47099
Here is a two-pass streamable version:
parse.awk
# First-pass: sum the numbers
FNR == NR { sum += $1; next }
# After first pass: determine sample size (N) and mean
# Note: run only once because of the f flag
!f {
N = NR-1 # Number of samples
mean = sum/N # The mean of the samples
f = 1
}
# Second-pass: add the squares of the sample distance to mean
{ varsum += ($1 - mean)**2 }
END {
# Sample standard deviation
sstd = sqrt( varsum/(N-1) )
print "Sample std: " sstd
}
Run it like this for a file:
awk -f parse.awk file.dat{,}
Run it like this for streams:
awk -f parse.awk <(seq 5) <(seq 5)
Output in both cases:
Sample std: 1.58114
Upvotes: 0
Reputation: 1731
Even though the title and tag say awk
, I wanted to add that calculating the mean and stdev for a column of data can be easily accomplished with datamash:
seq 1 5 | datamash mean 1 sstdev 1
3 1.5811388300842
It may be off-topic here (and I realize that programming simple tasks like that in awk
can be a good learning opportunity), but I think datamash
deserves some attention, specially for straightforward calculations such as this one. The documentation gives all the functions it can perform, and good examples as well for files with many columns. It is a fast and reliable alternative. Hope it helps!
Upvotes: 1
Reputation: 133518
Could you please try following and let me know if this helps you(this should work on provided data and if you actual file has more fields too).
awk '{for(i=1;i<=NF;i++){sum+=$i};mean=sum?sum/NF:0;sum="";for(j=1;j<=NF;j++){$j=($j-mean)*($j-mean);sum+=$j};print "Mean=",mean", S.D=",sqrt(sum/NF)}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0;
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
print "Mean=",mean", S.D=",sqrt(sum/NF)
}
' Input_file
EDIT: Adding code similar to above only thing adding exception handling kind of where if any of the value is ZERO it should print 0 then.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
val=sum?sqrt(sum/NF):0
print "Mean=",mean", S.D=",val
}
' Input_file
Upvotes: 1