How can I sort csv data alphabetically then numerically by column?

Question

If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.

Take for example this set of data in a tab separated csv file

Ranking  ID     Year  Make        Model     Total
1        128    2010  Infiniti    G37       128
2        124    2015  Jeep        Wrangler  124
3        15     014   Audi        S4        120
4        113    2012  Acura       Tsx       sportwagon  116
5        83     2014  Honda       Accord    112
6        112    2008  Acura       TL        110
7        65     2009  Honda       Fit       106
8        91     2010  Mitsu       Lancer    102
9        50     2015  Acura       TLX       102
10       31     2007  Honda       Fit       102
11       216    2007  Chrystler   300       96
12       126    2010  Volkswagen  Eos       92
13       13     2016  Honda       Civic     1.5t        92

If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?

Expected output below

Ranking   ID      Year    Make     Model           Total
1         113     2012    Acura    Tsx sportwagon  116
2         112     2008    Acura    TL              110
3         50      2015    Acura    TLX             106
4         83      2014    Honda    Accord          112
5         31      2007    Honda    Fit             102
6         13      2016    Honda    Civic 1.5t      92
...

Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column

BEGIN {
        FS = OFS = "	";
}
FNR == 1 {
        print;
        next;
}
FNR > 1 {
        a[NR] = $4;
}
END {
        PROCINFO["sorted_in"] = "@val_str_desc"
        for(i = 1; i < FN-1; i++) {
                print a[i];
        }
}

Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?

Renaud Pacalet · Accepted Answer

The following assumes bash (if you don't use bash replace $' ' by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically by Make column first, then numerically in decreasing order by Total, and finally keep at most the first 3 of each Make entries.

Sorting is a job for sort, head and tail can be used to isolate the header line, and awk can be used to keep maximum 3 of each Make, and re-number the first column:

$ head -n1 data.tsv; tail -n+2 data.tsv | sort -t$'	' -k4,4 -k6,6rn |
  awk -F'	' -vOFS='	' '$4==p {n+=1} $4!=p {n=1;p=$4} {$1=++r} n<=3'
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
10       128  2010  Infiniti    G37             128
11       124  2015  Jeep        Wrangler        124
12       91   2010  Mitsu       Lancer          102
13       126  2010  Volkswagen  Eos             92

Note that this is different from your expected output: Make is sorted in alphabetic order (Audi comes after Acura, not Honda) and only the 3 largest Total are kept (112, 106, 102 for Honda, not 112, 102, 92).

If you use GNU awk, and your input file is small enough to fit in memory, you can also do all this with just awk, thanks to its multidimensional arrays and its asorti function, that sorts arrays based on indices:

$ awk -F'	' -vOFS='	' 'NR==1 {print; next} {l[$4][$6][$0]}
  END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for(m in l) {
      n = asorti(l[m], t, "@ind_num_desc"); n = (n>3) ? 3 : n
      for(i=1; i<=n; i++) for(s in l[m][t[i]]) {$0 = s; $1 = ++r; print}
    }
  }' data.tsv
Ranking  ID   Year  Make        Model           Total
1        113  2012  Acura       Tsx sportwagon  116
2        112  2008  Acura       TL              110
3        50   2015  Acura       TLX             102
4        15   014   Audi        S4              120
5        216  2007  Chrystler   300             96
6        83   2014  Honda       Accord          112
7        65   2009  Honda       Fit             106
8        31   2007  Honda       Fit             102
9        128  2010  Infiniti    G37             128
10       124  2015  Jeep        Wrangler        124
11       91   2010  Mitsu       Lancer          102
12       126  2010  Volkswagen  Eos             92

How can I sort csv data alphabetically then numerically by column?

Answers (2)

Related Questions