user3831155
user3831155

Reputation: 81

Calculating 95th percentile with awk

I'm new in awk scripting and would like to have some help in calculating 95th percentile value for a file that consist of this data:

0.0001357
0.000112
0.000062
0.000054
0.000127
0.000114
0.000136

I tried:

cat filename.txt | sort -n |
awk 'BEGIN{c=0} {total[c]=$1; c++;} END{print total[int(NR*0.95-0.5)]}'

but I dont seem to get the correct value when I compare it to excel.

Upvotes: 8

Views: 19263

Answers (3)

rudi
rudi

Reputation: 791

Just for the record, there is also solution, inspired by merlin2011 answer, that prints several desired percentiles:

# get amount of values
num="$(wc -l input.txt | cut -f1 -d' ')";
# sort values
sort -n input.txt > temp && mv temp input.txt
# print the desired percentiles
for p in 50 70 80 90 92 95 99 100; do 
  printf "%3s%%: %-5.5sms\n" "$p" "$(head input.txt -n "$((num / 100 * $p))" | tail -n1)";
done

Update: I messed it up. Bash math can't handle floating numbers, even not if used during a "single expression". That only works for files with 100*(N>0) values. So either bc or awk is required to do the math.

In case you have an "odd" amount of values, you should replace "$((num / 100 * $p))" with "$(awk "BEGIN {print int($num/100*$p)}")" in the code above.

Finally awk is part of that answer. ;)

Upvotes: 0

Tiago Lopo
Tiago Lopo

Reputation: 7959

Following the calculation suggested here, you can do this:

sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'

Output for given input:

sort file -n | awk 'BEGIN{c=0} length($0){a[c]=$0;c++}END{p5=(c/100*5); p5=p5%1?int(p5)+1:p5; print a[c-p5-1]}'
0.0001357

Explanation:

  1. Sort the file numerically
  2. drop the top 5%
  3. pick the next value

PS. The statement p5=p5%1?int(p5)+1:p5 is doing a ceil operation available in many languages.

Upvotes: 2

merlin2011
merlin2011

Reputation: 75585

I am not sure if Excel does some kind of weighted percentile, but if you actually want one of the numbers that was in your original set, then your method should work correctly for rounding.

You can simplify a little bit like this, but it's the same thing.

sort -n input.txt  | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'

Upvotes: 13

Related Questions