Reputation: 1

Unix random sort only on an internal key / column

How would you perform a Unix sort only on an internal column?

The following statement seems reasonable, but it unexpectedly forgets about the first randomization step: it produces the same output when repeated.

$ sort --random-sort test.txt | sort --key=2,2
1 a 2
2 a 1
1 b 2
2 b 1

btw, my interest is eventually to create stratified random samples (which first requires randomization and grouping).

Upvotes: 0

Answers (1)

John1024

Reputation: 113934

If you want some randomness to remain, you need to add the --stable option to the second sort:

$ sort --random-sort test.txt | sort --key=2,2 --stable
2 a 1
1 a 2
1 b 2
2 b 1
$ sort --random-sort test.txt | sort --key=2,2 --stable
1 a 2
2 a 1
1 b 2
2 b 1

This is documented by gnu.org:

A pair of lines is compared as follows: sort compares each pair of fields, in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than --reverse (-r) were specified. The --stable (-s) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The --unique (-u) option also disables the last-resort comparison.

In other words, in your case, if two lines compare the same under key=2,2, sort will, by default, ignore your key selection and compare the entire line. By specifying --stable, the default behavior is suppressed and the original order is preserved for those lines.

Upvotes: 1

Unix random sort only on an internal key / column

Answers (1)

Related Questions