Reputation: 5146
I have a file in which I have numbers in seconds like below:
0.01033
0.003797
0.02648
0.007583
0.007491
0.028038
0.012794
0.00524
0.019655
0.019643
0.012969
0.011087
0.044564
What is the best way by which I can get "average", "mean", "median", "95th percentile" and "99th percentile" from this file? I have this file in my linux box so I need some linux command by which I can get those stats.
Upvotes: 5
Views: 3848
Reputation: 12448
As explained in the previous tool datamash
is a very powerful tool!
If you want a full awk
solution:
Average: (variables are auto-initialized to zero by awk
)
awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }'
or in the Shebang notation:
#!/bin/awk
{ sum += $2 }
END { if (NR > 0) print sum / NR }
Median:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
You need to sort the file before using it:
sort -n data_file | awk -f median.awk
95th Percentile:
sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
Last but not least you can use Miller https://github.com/johnkerl/miller/tree/v4.5.0
Upvotes: 1
Reputation: 18697
In case you're not bound to any specific tool, try GNU datamash
- a nice tool for "command-line statistical operations" on textual files.
To get mean, median, percentile 95 and percentile 99 values for first column/field (note, fields are TAB
-separated by default):
$ datamash --header-out mean 1 median 1 perc:95 1 perc:99 1 < file
mean(field-1) median(field-1) perc:95(field-1) perc:99(field-1)
0.016128538461538 0.012794 0.0346484 0.04258088
Upvotes: 10