Wara
Wara

Reputation: 322

AWK output problems

I wrote a script for getting the MEAN and the STDEV from a data file. Let's say the data file has this data:

1 
2 
3 
4 
5

The awk script looks like this

awk '{MEAN+=$1/5}END{print MEAN, STDEV=sqrt(($1-MEAN)**2/4)}' dat.dat>stat1.dat

but it gives me an incorrect value of STDEV=1. It must be 1.5811. Do you know what is incorrect in my script? how could I improve it?

Upvotes: 2

Views: 72

Answers (4)

karakfa
karakfa

Reputation: 67497

you can do the same in one pass

$ seq 5 | awk '{sum+=$1; sqsum+=$1^2} 
            END{mean=sum/NR; 
                print mean, sqrt((sqsum-NR*mean^2)/(NR-1))}'

3 1.58114

note that this is the std definition for "sample population" (divide by N-1).

Upvotes: 1

Thor
Thor

Reputation: 47099

Here is a two-pass streamable version:

parse.awk

# First-pass:  sum the numbers
FNR == NR { sum += $1; next }

# After first pass:  determine sample size (N) and mean
# Note:  run only once because of the f flag
!f { 
  N    = NR-1    # Number of samples
  mean = sum/N   # The mean of the samples
  f    = 1
}

# Second-pass:  add the squares of the sample distance to mean
{ varsum += ($1 - mean)**2 }

END {
  # Sample standard deviation
  sstd = sqrt( varsum/(N-1) )
  print "Sample std: " sstd
}

Run it like this for a file:

awk -f parse.awk file.dat{,}

Run it like this for streams:

awk -f parse.awk <(seq 5) <(seq 5)

Output in both cases:

Sample std: 1.58114

Upvotes: 0

Vinicius Placco
Vinicius Placco

Reputation: 1731

Even though the title and tag say awk, I wanted to add that calculating the mean and stdev for a column of data can be easily accomplished with datamash:

seq 1 5 | datamash mean 1 sstdev 1
3   1.5811388300842

It may be off-topic here (and I realize that programming simple tasks like that in awk can be a good learning opportunity), but I think datamash deserves some attention, specially for straightforward calculations such as this one. The documentation gives all the functions it can perform, and good examples as well for files with many columns. It is a fast and reliable alternative. Hope it helps!

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133518

Could you please try following and let me know if this helps you(this should work on provided data and if you actual file has more fields too).

awk '{for(i=1;i<=NF;i++){sum+=$i};mean=sum?sum/NF:0;sum="";for(j=1;j<=NF;j++){$j=($j-mean)*($j-mean);sum+=$j};print "Mean=",mean", S.D=",sqrt(sum/NF)}'  Input_file

Adding a non-one liner form of solution too now.

awk '
{
  for(i=1;i<=NF;i++){  sum+=$i  };
  mean=sum?sum/NF:0;
  sum="";
  for(j=1;j<=NF;j++){  $j=($j-mean)*($j-mean);
                       sum+=$j};
                       print "Mean=",mean", S.D=",sqrt(sum/NF)
}
'  Input_file

EDIT: Adding code similar to above only thing adding exception handling kind of where if any of the value is ZERO it should print 0 then.

awk '
{
  for(i=1;i<=NF;i++){  sum+=$i  };
  mean=sum?sum/NF:0
  sum="";
  for(j=1;j<=NF;j++){  $j=($j-mean)*($j-mean);
                       sum+=$j};
                       val=sum?sqrt(sum/NF):0
                       print "Mean=",mean", S.D=",val
}
'  Input_file

Upvotes: 1

Related Questions