mike
mike

Reputation: 87

Awk find max and min per key

I have a data look like this: The highest and lowest midterm grade of each class in school A and school B (class not shown)

#school   highest  lowest
schoolA   99       53
schoolA   95       66
schoolA   88       48
schoolB   94       55
schoolB   91       36

I want to merge it like this:

schoolA   99       48
schoolB   94       36

the max and the min of each school are shown. I have tried something like:

awk '
BEGIN{getline;min=$3;max=$2} 
{($3<min)?min=$3:"";($2>max)?max=$2:""} 
END{OFS="\t";print $1,max,min}
'

and it worked; however, it sometimes automatically plus one point at the min (the third column)

Could anyone teach me how to do this correctly and explain a little bit what does the code above mean? (especially the "getline") Can this code merge lines by column one (school)?

Upvotes: 0

Views: 790

Answers (4)

tripleee
tripleee

Reputation: 189357

getline fetches the next input line. Doing this from the BEGIN block is kind of inelegant, though, as Awk will read every line anyway. I would do this instead:

NR==1{min=$3;max=$2;next} 

to ensure that max and min are initialized to the values from the first line when the script starts.

The ($3<min?min=$3:"") and the corresponding statement for $2 and max is pretty obscure. The construct x ? y : z is called the ternary operator and is a shorthand for if (x) y; else z so this evaluates to

if($3<min)
  min=$3
else
  ""

where "" in this context is basically like a comment, i.e. a way to say explicitly "do nothing".

Your current script fetches the max and min overall. You apparently want to get the max and min separately for each key instead. Assuming all values for each school are adjacent,

NR>1 && $1!=prev { print prev, max, min }
NR==1 || $1!=prev { prev=$1; max=$2; min=$3 }
$2>max { max=$2 }
$3<min { min=$3 }
END { print prev, max, min }

If the input is sorted, at least enough that all mentions of one keyeare on adjacent lines, Awk can process a much larger input file, because it doesn't need to keep all keys in memory. Your sample file seems to have this property, so if it's representative, this should work. If not, maybe use sort file | awk (after discarding any header line, if present).

If the input file has a header, but it's already sorted, add NR==1 { next } at the top to skip it, and adjust the other NR mentions to compare to 2 instead of 1.

Upvotes: 0

dawg
dawg

Reputation: 103774

With POSIX awk, you can do:

awk ' BEGIN{fmt="%-15s%-10s%-10s\n"; printf fmt,"School","max","min"}
      !($1 in sch) {idx[++i]=$1; sch[$1]; arr[$1,"min"]=100}
      $2>arr[$1,"max"]{arr[$1,"max"]=$2}
      $3<arr[$1,"min"]{arr[$1,"min"]=$3}
      END{for (e=1;e<=length(idx);e++) printf fmt,idx[e],arr[idx[e],"max"],arr[idx[e],"min"]}' file
School         max       min       
schoolA        99        48        
schoolB        94        36      

As written, this maintains file order for the schools printed. It is simpler to write if you do not care about the output order.

It is unclear if your file has a header or not.

If it DOES have a header, replace printf fmt,"School","max","min" with FNR==1{printf fmt,$1,$2,$3} to print the header. (or FNR==1{next} if you want to skip the header..)

Upvotes: 0

Sundeep
Sundeep

Reputation: 23667

If datamash is okay:

$ datamash -W -g1 max 2 min 3 < ip.txt 
schoolA 99  48
schoolB 94  36
  • -W to use whitespace as delimiter
  • -g1 group by 1st field
  • max 2 min 3 maximum of 2nd field and minimum of 3rd field
  • if input file includes header line, use --header-in option to ignore it

Upvotes: 4

Kent
Kent

Reputation: 195039

first of all, I don't think your (condition)?var=one:two is correct. E.g:

awk 'BEGIN{(3>5)?a=1:2;print a}'

outputs nothing. It should be written in:

$ awk 'BEGIN{a=(3>5)?1:2;print a}' 
2
  • You don't need the getline.
  • You didn't check the school therefor your codes will find min/max from all schools values. It is not what you want.

For your problem, you can write like this:

awk -v OFS='\t' '$1 in min{min[$1]=$3<min[$1]?$3:min[$1]
               max[$1]=$2>max[$1]?$2:max[$1]
               next } {min[$1]=$3;max[$1]=$2}
            END{for(x in min)print x, max[x], min[x]}' file

Upvotes: 0

Related Questions