user1950349
user1950349

Reputation: 5146

How to get average, median, mean stats from a file which has numbers in first column?

I have a file in which I have numbers in seconds like below:

0.01033
0.003797
0.02648
0.007583
0.007491
0.028038
0.012794
0.00524
0.019655
0.019643
0.012969
0.011087
0.044564

What is the best way by which I can get "average", "mean", "median", "95th percentile" and "99th percentile" from this file? I have this file in my linux box so I need some linux command by which I can get those stats.

Upvotes: 5

Views: 3848

Answers (2)

Allan
Allan

Reputation: 12448

As explained in the previous tool datamash is a very powerful tool! If you want a full awk solution:

Average: (variables are auto-initialized to zero by awk)

awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }'

or in the Shebang notation:

#!/bin/awk

{ sum += $2 }
END { if (NR > 0) print sum / NR }

Median:

#/usr/bin/env awk
{
    count[NR] = $1;
}
END {
    if (NR % 2) {
        print count[(NR + 1) / 2];
    } else {
        print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
    }
} 

You need to sort the file before using it:

sort -n data_file | awk -f median.awk

95th Percentile:

sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'

Last but not least you can use Miller https://github.com/johnkerl/miller/tree/v4.5.0

Upvotes: 1

randomir
randomir

Reputation: 18697

In case you're not bound to any specific tool, try GNU datamash - a nice tool for "command-line statistical operations" on textual files.

To get mean, median, percentile 95 and percentile 99 values for first column/field (note, fields are TAB-separated by default):

$ datamash --header-out mean 1 median 1 perc:95 1 perc:99 1  < file
mean(field-1)   median(field-1) perc:95(field-1)    perc:99(field-1)
0.016128538461538   0.012794    0.0346484   0.04258088

Upvotes: 10

Related Questions