Reputation: 87
I have a data look like this: The highest and lowest midterm grade of each class in school A and school B (class not shown)
#school highest lowest
schoolA 99 53
schoolA 95 66
schoolA 88 48
schoolB 94 55
schoolB 91 36
I want to merge it like this:
schoolA 99 48
schoolB 94 36
the max and the min of each school are shown. I have tried something like:
awk '
BEGIN{getline;min=$3;max=$2}
{($3<min)?min=$3:"";($2>max)?max=$2:""}
END{OFS="\t";print $1,max,min}
'
and it worked; however, it sometimes automatically plus one point at the min (the third column)
Could anyone teach me how to do this correctly and explain a little bit what does the code above mean? (especially the "getline") Can this code merge lines by column one (school)?
Upvotes: 0
Views: 790
Reputation: 189357
getline
fetches the next input line. Doing this from the BEGIN
block is kind of inelegant, though, as Awk will read every line anyway. I would do this instead:
NR==1{min=$3;max=$2;next}
to ensure that max
and min
are initialized to the values from the first line when the script starts.
The ($3<min?min=$3:"")
and the corresponding statement for $2
and max
is pretty obscure. The construct x ? y : z
is called the ternary operator and is a shorthand for if (x) y; else z
so this evaluates to
if($3<min)
min=$3
else
""
where ""
in this context is basically like a comment, i.e. a way to say explicitly "do nothing".
Your current script fetches the max and min overall. You apparently want to get the max and min separately for each key instead. Assuming all values for each school are adjacent,
NR>1 && $1!=prev { print prev, max, min }
NR==1 || $1!=prev { prev=$1; max=$2; min=$3 }
$2>max { max=$2 }
$3<min { min=$3 }
END { print prev, max, min }
If the input is sorted, at least enough that all mentions of one keyeare on adjacent lines, Awk can process a much larger input file, because it doesn't need to keep all keys in memory. Your sample file seems to have this property, so if it's representative, this should work. If not, maybe use sort file | awk
(after discarding any header line, if present).
If the input file has a header, but it's already sorted, add NR==1 { next }
at the top to skip it, and adjust the other NR
mentions to compare to 2 instead of 1.
Upvotes: 0
Reputation: 103774
With POSIX awk, you can do:
awk ' BEGIN{fmt="%-15s%-10s%-10s\n"; printf fmt,"School","max","min"}
!($1 in sch) {idx[++i]=$1; sch[$1]; arr[$1,"min"]=100}
$2>arr[$1,"max"]{arr[$1,"max"]=$2}
$3<arr[$1,"min"]{arr[$1,"min"]=$3}
END{for (e=1;e<=length(idx);e++) printf fmt,idx[e],arr[idx[e],"max"],arr[idx[e],"min"]}' file
School max min
schoolA 99 48
schoolB 94 36
As written, this maintains file order for the schools printed. It is simpler to write if you do not care about the output order.
It is unclear if your file has a header or not.
If it DOES have a header, replace printf fmt,"School","max","min"
with FNR==1{printf fmt,$1,$2,$3}
to print the header. (or FNR==1{next}
if you want to skip the header..)
Upvotes: 0
Reputation: 23667
If datamash is okay:
$ datamash -W -g1 max 2 min 3 < ip.txt
schoolA 99 48
schoolB 94 36
-W
to use whitespace as delimiter-g1
group by 1st fieldmax 2 min 3
maximum of 2nd field and minimum of 3rd field--header-in
option to ignore itUpvotes: 4
Reputation: 195039
first of all, I don't think your (condition)?var=one:two
is correct. E.g:
awk 'BEGIN{(3>5)?a=1:2;print a}'
outputs nothing. It should be written in:
$ awk 'BEGIN{a=(3>5)?1:2;print a}'
2
getline
.school
therefor your codes will find min/max from all schools values. It is not what you want.For your problem, you can write like this:
awk -v OFS='\t' '$1 in min{min[$1]=$3<min[$1]?$3:min[$1]
max[$1]=$2>max[$1]?$2:max[$1]
next } {min[$1]=$3;max[$1]=$2}
END{for(x in min)print x, max[x], min[x]}' file
Upvotes: 0