Reputation: 2067
I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
Upvotes: 1
Views: 69
Reputation: 14442
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Upvotes: 1
Reputation: 133458
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range
variable as 1
since difference of 1st digit is 2
so in your case case lets say 1,1001
and so on is there then keep range variable value as 999
for same.
Upvotes: 1