wholock
wholock

Reputation: 187

Trying to understand the sort utilty in linux

I have a file named a.csv. which contains

100008,3
10000,3
100010,5
100010,4
10001,6
100021,7

After running this command sort -k1 -d -t "," a.csv

The result is

10000,3
100008,3
100010,4
100010,5
10001,6
100021,7

Which is unexpected because 10001 should come first than 100010

Trying to understand why this happened from long time. but couldn't get any answers.

$ sort --version
sort (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

Upvotes: 0

Views: 81

Answers (5)

Gsxr1k
Gsxr1k

Reputation: 157

Some of the other responses have assumed this is a numeric sort vs dictionary sort problem. It isn't, as even sorting alphabetically the output given in the question is incorrect.

The answer

To get the correct sorting, you need to change -k1 to -k1,1:

$ sort -k1,1 -d -t "," a.csv
10000,3
100008,3
10001,6
100010,4
100010,5
100021,7

The reason

The -k option takes two numbers, the start and end fields to sort (i.e. -ks,e where s is the start and e is the end). By default, the end field is the end of the line. Hence, -k1 is the same as not giving the -k option at all. To show this, compare:

$ printf "1,a,1\n2,aa,2\n" | sort -k2 -t,
1,a,1
2,aa,2

with:

$ printf "1~a~1\n2~aa~2\n" | sort -k2 -t~
2~aa~2
1~a~1

The first sorts a,1 before aa,2, while the second sorts aa~2 before a~1 since, in ASCII, , < a < ~.

To get the desired behaviour, therefore, we need to sort only one field. In your case, that means using 1 as both the start and end field, so you specify -k1,1. If you try the two examples above with -k2,2 instead of -k2, you'll find you get the same (correct) ordering in both cases.

Many thanks to Eric and Assaf from the coreutils mailing list for pointing this out.

Upvotes: 2

Eric Blake
Eric Blake

Reputation: 59

You have not found a bug in sort. Your usage bug is that you used '-k1' ("set the key to the first field through the end of the line") instead of '-k1,1' ("set the key to use only the first field"). If you use GNU sort, the --debug option will show you the difference. The delimiter is included in the key as long as the key extends beyond a single field.

Upvotes: 2

Richard St-Cyr
Richard St-Cyr

Reputation: 995

The sort is alphabetical, not numerical. Replace -d by -n in your option list to sort numerically.

Upvotes: 0

Jonathan.Brink
Jonathan.Brink

Reputation: 25383

The -d option is for --dictionary-order:

-d, --dictionary-order consider only blanks and alphanumeric characters

But I think you want to use -n (--numeric-sort) instead:

-n, --numeric-sort compare according to string numerical value

So, change your command to look like this:

sort -k1 -n -t "," a.csv

http://man7.org/linux/man-pages/man1/sort.1.html

Upvotes: 0

Roberto
Roberto

Reputation: 2185

It sorts alphabetically, not numerically, so "," is before "0", i.e. more like a dictionary

Upvotes: 0

Related Questions