Reputation: 107
If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.
Take for example this set of data in a tab separated csv file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?
Expected output below
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 106
4 83 2014 Honda Accord 112
5 31 2007 Honda Fit 102
6 13 2016 Honda Civic 1.5t 92
...
Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column
BEGIN {
FS = OFS = "\t";
}
FNR == 1 {
print;
next;
}
FNR > 1 {
a[NR] = $4;
}
END {
PROCINFO["sorted_in"] = "@val_str_desc"
for(i = 1; i < FN-1; i++) {
print a[i];
}
}
Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?
Upvotes: 0
Views: 930
Reputation: 203334
Using GNU awk for arrays of arrays and sorted_in
:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 {
print
next
}
{
rows[$4][$6][++numRows[$4,$6]] = $0
}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for ( make in rows ) {
PROCINFO["sorted_in"] = "@ind_num_desc"
cnt = 0
for ( total in rows[make] ) {
for ( rowNr=1; rowNr<=numRows[make,total]; rowNr++ ) {
if ( ++cnt <= 3 ) {
row = rows[make][total][rowNr]
print row, cnt
}
}
}
}
}
$ awk -f tst.awk file
Ranking ID Year Make Model Total
4 113 2012 Acura Tsx sportwagon 116 1
6 112 2008 Acura TL 110 2
9 50 2015 Acura TLX 102 3
3 15 014 Audi S4 120 1
11 216 2007 Chrystler 300 96 1
5 83 2014 Honda Accord 112 1
7 65 2009 Honda Fit 106 2
10 31 2007 Honda Fit 102 3
1 128 2010 Infiniti G37 128 1
2 124 2015 Jeep Wrangler 124 1
8 91 2010 Mitsu Lancer 102 1
12 126 2010 Volkswagen Eos 92 1
The above will handle cases where multiple cars of 1 make have the same total by always just printing the top 3 rows for that make, e.g. gven this input where 4 Acuras all have 116 total:
$ cat file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
4 113 2012 Acura Foo 116
4 113 2012 Acura Bar 116
4 113 2012 Acura Other 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
this is the output showing just 3 of those 4 116 Acuras:
$ awk -f tst.awk file
Ranking ID Year Make Model Total
4 113 2012 Acura Tsx sportwagon 116 1
4 113 2012 Acura Foo 116 2
4 113 2012 Acura Bar 116 3
3 15 014 Audi S4 120 1
11 216 2007 Chrystler 300 96 1
5 83 2014 Honda Accord 112 1
7 65 2009 Honda Fit 106 2
10 31 2007 Honda Fit 102 3
1 128 2010 Infiniti G37 128 1
2 124 2015 Jeep Wrangler 124 1
8 91 2010 Mitsu Lancer 102 1
12 126 2010 Volkswagen Eos 92 1
If that's not what you want then move the if ( ++cnt <= 3 )
test to the outer loop or handle it however else you want.
Upvotes: 2
Reputation: 29025
The following assumes bash (if you don't use bash replace $'\t'
by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically by Make
column first, then numerically in decreasing order by Total
, and finally keep at most the first 3 of each Make
entries.
Sorting is a job for sort
, head
and tail
can be used to isolate the header line, and awk
can be used to keep maximum 3 of each Make
, and re-number the first column:
$ head -n1 data.tsv; tail -n+2 data.tsv | sort -t$'\t' -k4,4 -k6,6rn |
awk -F'\t' -vOFS='\t' '$4==p {n+=1} $4!=p {n=1;p=$4} {$1=++r} n<=3'
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
10 128 2010 Infiniti G37 128
11 124 2015 Jeep Wrangler 124
12 91 2010 Mitsu Lancer 102
13 126 2010 Volkswagen Eos 92
Note that this is different from your expected output: Make
is sorted in alphabetic order (Audi
comes after Acura
, not Honda
) and only the 3 largest Total
are kept (112, 106, 102
for Honda
, not 112, 102, 92
).
If you use GNU awk
, and your input file is small enough to fit in memory, you can also do all this with just awk
, thanks to its multidimensional arrays and its asorti
function, that sorts arrays based on indices:
$ awk -F'\t' -vOFS='\t' 'NR==1 {print; next} {l[$4][$6][$0]}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for(m in l) {
n = asorti(l[m], t, "@ind_num_desc"); n = (n>3) ? 3 : n
for(i=1; i<=n; i++) for(s in l[m][t[i]]) {$0 = s; $1 = ++r; print}
}
}' data.tsv
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
9 128 2010 Infiniti G37 128
10 124 2015 Jeep Wrangler 124
11 91 2010 Mitsu Lancer 102
12 126 2010 Volkswagen Eos 92
Upvotes: 2