xgrau
xgrau

Reputation: 309

distribution of values in intervals with awk

I'd like to count the number of records of an input file (containing unsorted numeric values) that fall within a series of given intervals, between a minimum and maximum values.

Let me explain it with an example. Given this input:

 text 12
 text 1
 xxxx 19
 ffff 0
 dddd 5
 dddd 7
 pppp 41

I'd like to count the number of lines whose second field is in the 0-10 interval, 11-20, 21-30, etc. (step = 10)

 awk '{ 
 if      ($2 =< 10) first++
 else if ($2 > 10 && $2 <= 20)
 second++
 else if ($2 > 20 && $2 <= 30)      
 third++
 else if ($2 > 30 && $2 <= 40)
 fourth++
 else if ($2 > 40 && $2 <= 50)
 fifth++
 } END {
 print first,second,third,forth,fifth
 }' input.txt

This gives me a count like that:

 4 2 0 0 1

The problem is that i'd like to build the script so that the end of the range of intervals AND the number of intervals can be arbitrary, depending on the input.

That is, I'd like to use the largest value in the file (41) to define the last range. Given a step=10, the last range would be automatically assigned to 41-50. But these numbers would change depending on the input.

Is there a way to build a for loop that does what I need?

Sorry I could'nt be more precise with my code snippet, but I've never used for loops in awk before for such things.

Thanks in advance!

Upvotes: 0

Views: 682

Answers (1)

Ed Morton
Ed Morton

Reputation: 203645

I'm confused by your question but if I understand what you want then this is the right approach:

$ cat tst.awk
{
    bucket = int(($2/10)+1)
    count[bucket]++
    max = ((NR==1 || bucket>max) ? bucket : max)
}
END {
    for (bucket=1;bucket<=max;bucket++) {
        printf "%d%s", count[bucket], (bucket<max?OFS:ORS)
    }
}

$ awk -f tst.awk file
4 2 0 0 1

Change 10 to whatever number you like or use a variable if you prefer. If you have a predefined max value of bucket you want to use then use a variable for that too instead of calculating max.

Upvotes: 1

Related Questions