Kay
Kay

Reputation: 2067

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows

ifile.dat
1   10
3   34
1   4
3   32
5   3
2   2
4   20
3   13
4   50
1   40
2   20
5   2

I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.

ofile.dat
1-2   40  15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2; 
               #40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4   50  29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4; 
               #50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6   3    2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6; 
               #3 is their maximum and 2.5 is their mean i.e. (3+2)/2

Similarly if I choose the range of selection with 3 number, then the desire output will be

ofile.dat
1-3 40 19.37
4-6 50 18.7

I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.

awk '{
    if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}} 
    {b[$1]+=$2; c[$1]++} END{for (i in b) 
    printf "%d %2s %5s  %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat

The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...

Upvotes: 1

Views: 69

Answers (2)

dash-o
dash-o

Reputation: 14442

The following awk script will provide basic descriptive statistics with grouping.

Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.

Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.

awk '
{
    # Total Counter
    nn++ ;

    # Group id
    gsize = 1000
    gid = int(($1-1)/gsize )

    v = $2
    # Setup new group, if needed
    if ( !n[gid] ) {
        n[gid] = 0
        sum[gid] = 0
        max[gid] = min[gid] = v 
        name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize) 
    }
    if ( v > max[gid] ) max[gid] = v
    sum[gid] += v
    n[gid]++
}
END {
    # Print all groups
    for (gid in name) {
        printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
    }
}
'

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following, tested and written with shown samples only.

sort -k1 Input_file |
awk -v range="1" '
  !b[$1]++{
    c[++count]=$1
  }
  {
    a[$1]=a[$1]>$2?a[$2]:$2
    d[$1]+=$2
    e[$1]++
    till=$1
  }
  END{
    for(i=1;i<=till;i+=(range+1)){
       for(j=i;j<=i+range;j++){
          max=max>a[c[j]]?max:a[c[j]]
          total+=d[c[j]]
          occr+=e[c[j]]
       }
       print i"-"i+range,max,occr?total/occr:0
       occr=total=max=""
    }
  }
'

For shown samples output will be as follows.

1-2 40 15.2
3-4 50 29.8
5-6 3 2.5

I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Upvotes: 1

Related Questions