ajfbiw.s
ajfbiw.s

Reputation: 401

Differences between Unix commands for Sorting CSV

What's the difference between:

!tail -n +2 hits.csv | sort -k 1n -o output.csv

and

!tail -n +2 hits.csv | sort -t "," -k1 -n -k2 > output.csv

? I'm trying to sort a csv file by first column first, then by the second column, so that lines with the same first column are still together.

It seems like the first one already does that correctly, by first sorting by the field before the first comma, then by the field following the first comma. (breaking ties, that is.)

Or does it not actually do that?

And what does the second command do/mean? (And what's the difference between the two?) There is a significant difference between the two output.csv files when I run the two.

And, finally, which one should I use? (Or are they both wrong?)

Upvotes: 0

Views: 118

Answers (2)

morido
morido

Reputation: 1017

First off: you want to remove the leading ! from these two commands. In Bash (and probably others since this comes from csh) you are otherwise referencing the last command that contained tail in your history which does not make sense here.


The main difference between your two versions is that in the first case you are not taking the second column into account.

This is how I would do it:

tail -n +2 hits.csv | sort -t "," -n --key=1,1 --key=2,2 > output.csv
  • -t specifies the field separator
  • -n turns on numerical sorting order
  • --key specifies the fields that should be used for sorting (in order of precedence)

Upvotes: 1

twalberg
twalberg

Reputation: 62379

See also the answer by @morido for some other pointers, but here's a description of exactly what those two sort invocations do:

sort -k 1n -o output.csv

This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.

sort -t "," -k1 -n -k2

This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).

Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by @morido:

sort -t, -k1,1 -k2,2

or

sort -t, -k1,1n -k2,2n          (alternatively sort -t, -n -k1,1 -k2,2)

if the data is numeric.

Upvotes: 1

Related Questions