Reputation: 11
I'm having some problems understanding the behavior of sort with locale set to en_US.UTF-8. Here is my example data:
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000481096 MAGED4B X 52063479 52063359 125 4.02e-29
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000481096 MAGED4B X 52063479 52063359 125 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000479281 MAGED4 X 52190616 52190736 125 3.75e-29
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000479281 MAGED4 X 52190616 52190736 125 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408548 SNORA11D X 52190621 52190736 118 1.30e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408548 SNORA11D X 52190621 52190736 118 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408778 SNORA11E X 52063474 52063359 118 1.30e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408778 SNORA11E X 52063474 52063359 118 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214906 ENST00000408163 SNORA11 15 45215033 45214906 128 5.31e-61
ENST00000623237 CTD-2651B20.8 15 45215033 45214906 ENST00000408163 SNORA11 15 45215033 45214906 128 9.60e-62
ENST00000623237 CTD-2651B20.8 15 45215033 45214915 ENST00000408789 SNORA11 X 54814370 54814486 121 4.28e-32
ENST00000623237 CTD-2651B20.8 15 45215033 45214915 ENST00000408789 SNORA11 X 54814370 54814486 121 7.74e-33
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000408823 SNORA11 X 54927305 54927374 70 2.02e-20
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000408823 SNORA11 32 54927305 54927374 70 3.69e-21
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000469211 TRO X 54927305 54927374 70 2.02e-20
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000469211 TRO X 54927305 54927374 70 2.89e-20
I would now need to sort based on the 7th column (MAGED4B...)...
So, if I run:
cut -f1,7 sortTest.txt | sort -k2,2
I get the expected output:
ENST00000623237 MAGED4
ENST00000623237 MAGED4
ENST00000623237 MAGED4B
ENST00000623237 MAGED4B
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11D
ENST00000623237 SNORA11D
ENST00000623237 SNORA11E
ENST00000623237 SNORA11E
ENST00000623237 TRO
ENST00000623237 TRO
But when I add the column next to the one to be sorted:
cut -f1,7,8 sortTest.txt | sort -k2,2
Results are no longer as expected:
ENST00000623237 MAGED4B X
ENST00000623237 MAGED4B X
ENST00000623237 MAGED4 X
ENST00000623237 MAGED4 X
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 32
ENST00000623237 SNORA11D X
ENST00000623237 SNORA11D X
ENST00000623237 SNORA11E X
ENST00000623237 SNORA11E X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 TRO X
ENST00000623237 TRO X
To make thinks even more weird, when I append not the next, but the 2nd next to the 7th column:
cut -f1,7,9 sortTest.txt | sort -k2,2
The output is again as expected:
ENST00000623237 MAGED4 52190616
ENST00000623237 MAGED4 52190616
ENST00000623237 MAGED4B 52063479
ENST00000623237 MAGED4B 52063479
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11D 52190621
ENST00000623237 SNORA11D 52190621
ENST00000623237 SNORA11E 52063474
ENST00000623237 SNORA11E 52063474
ENST00000623237 TRO 54927305
ENST00000623237 TRO 54927305
I have used the "--debug" parameter (with and w/o -b as well) on all trials to check whether fields may be identified wrongly, but this is not the case...
This "issue" resolves if I set LC_ALL=C, but I would prefer not do it as I'm not sure how this may effect the rest of my pipeline...
Upvotes: 1
Views: 201
Reputation: 241861
Note: In response to a comment by OP, I re-examined this answer. Indeed, if the sort
command were sort -k2,2
, the output would be mysterious, and I cannot reproduce it with Gnu sort. So I suspect that the actual command was -k2,3
(or, equivalently, -k2
), and I'll leave the answer below on that basis.
The en_US.UTF8
locale sorts digits before letters and ignores whitespace. It produces unexpected ordering in your first example because X
comes after E
but not in the second example because all digits come before D
. Looking at it without whitespace (the way the collation sees it) might clarify:
Columns 7 and 8:
ENST00000623237 MAGED4BX
ENST00000623237 MAGED4BX
ENST00000623237 MAGED4X
ENST00000623237 MAGED4X
ENST00000623237 SNORA1115
ENST00000623237 SNORA1115
ENST00000623237 SNORA1132
ENST00000623237 SNORA11DX
ENST00000623237 SNORA11DX
ENST00000623237 SNORA11EX
ENST00000623237 SNORA11EX
ENST00000623237 SNORA11X
ENST00000623237 SNORA11X
ENST00000623237 SNORA11X
ENST00000623237 TROX
ENST00000623237 TROX
Columns 7 and 9:
ENST00000623237 MAGED452190616
ENST00000623237 MAGED452190616
ENST00000623237 MAGED4B52063479
ENST00000623237 MAGED4B52063479
ENST00000623237 SNORA1145215033
ENST00000623237 SNORA1145215033
ENST00000623237 SNORA1154814370
ENST00000623237 SNORA1154814370
ENST00000623237 SNORA1154927305
ENST00000623237 SNORA1154927305
ENST00000623237 SNORA11D52190621
ENST00000623237 SNORA11D52190621
ENST00000623237 SNORA11E52063474
ENST00000623237 SNORA11E52063474
ENST00000623237 TRO54927305
ENST00000623237 TRO54927305
You can set an environment variable locally for a single c command by putting the setting at the beginning of the command:
... | LC_COLLATE=C sort ... | ...
So you don't have to worry about the setting affecting other commands.
I alias sort
to LC_COLLATE=C sort
in my bash startup file, because the default debian/ubuntu collation order is useless (at best) for sorting, and also creates a significant slowdown.
A longer answer about the locale issue is here.
You can avoid this particular locale issue by sorting adjacent columns as separate keys (sort -k2,2 -k3,3
instead of sort -k2,3
), but unless locale-based sorting is important to the data, it's faster and less confusing to avoid it.)
Upvotes: 1