EarthIsHome
EarthIsHome

Reputation: 735

awk counting number of digits within a given range

How can I count the number of times a digit within a given range of numbers in a field occurs?

For example, the raw text foo.txt is shown below:

2,3,4,2,4
2,3,4,32,4
2,3,4,12,4
2,3,4,4,4
2,3,4,,4
2,3,4,15,4
2,3,4,15,4

I want to count the number of times a digit in field #4 falls between the following ranges: [0,10) and [10,20), where the lower bound is inclusive and the upper bound is not.

The result should be:

range 0-10: 2 range 10-20: 3

Here is my awk code below, but I am getting 8600001 for both ranges, awk -f prog.awk foo.txt:

#!/usr/range/awk
# prog.awk

BEGIN {
    FS=",";
    $range1=0;
    $range2=0;
}
$4 ~ /[0-9]/ && $4 >= 0 && $4 < 10 { $range1 += 1 };
$4 ~ /[0-9]/ && $4 >= 10 && $4 < 20 { $range2 += 1 };
END {
    print $range1, "\t", $range2;
}

Upvotes: 0

Views: 790

Answers (2)

karakfa
karakfa

Reputation: 67547

another awk

$ awk -F, '$4>=0{a[int($4/10)]++} 
             END{print "range 0-10:" a[0],"range 10-20:" a[1]}' file

range 0-10:2 range 10-20:3

can be easily expanded to cover the full range

$ awk -F, '$4>=0{a[int($4/10)]++} 
             END{for(k in a) print "range ["k*10"-"(k+1)*10"):", a[k]}' file

range [0-10): 2
range [10-20): 3
range [30-40): 1

Upvotes: 3

John1024
John1024

Reputation: 113944

$ awk -F, '0<=$4 && $4<10{a++} 10<=$4 && $4<20{b++}  END{printf "range 0-10: %i range 10-20: %i\n",a,b}' foo.txt
range 0-10: 2 range 10-20: 3

How it works

  • 0<=$4 && $4<10{a++}

    This counts every time the fourth field is in [0,10).

  • 10<=$4 && $4<20{b++}

    This counts every time the fourth field is in [10,20).

  • END{printf "range 0-10: %i range 10-20: %i\n",a,b}

    After we have finished reading the file, this prints out the results in the desired format.

Multiline version

For those who prefer their code spread over multiple lines:

awk -F, '
    0<=$4 && $4<10 {
        a++
    } 

    10<=$4 && $4<20{
        b++
    }

    END{
        printf "range 0-10: %i range 10-20: %i\n", a, b
    }
    ' foo.txt

Modified version of original code

In awk, $range1 is the value of field whose number is range1. This is not what you want. If you are not referencing a field number, do not use $. Thus:

BEGIN {
    FS=",";
    range1=0;
    range2=0;
}
$4 ~ /[0-9]/ && $4 >= 0 && $4 < 10 { range1 += 1 };
$4 ~ /[0-9]/ && $4 >= 10 && $4 < 20 { range2 += 1 };
END {
    print range1, "\t", range2;
}

Note that initializing the range variables to zero is not necessary: zero is the default value for a numeric variable.

Upvotes: 3

Related Questions