Reputation: 63994
I have a space delimited tabular file that looks like this:
>NODE 28 length 23 cov 11.043478 ACATCCCGTTACGGTGAGCCGAAAGACCTTATGTATTTTGTGG
>NODE 32 length 21 cov 13.857142 ACAGATGTCATGAAGAGGGCATAGGCGTTATCCTTGACTGG
>NODE 33 length 28 cov 14.035714 TAGGCGTTATCCTTGACTGGGTTCCTGCCCACTTCCCGAAGGACGCAC
How can I use Unix sort
to sort it by length of DNA sequence [ATCG]?
Upvotes: 4
Views: 1393
Reputation: 28000
With Perl:
perl -e'
print sort {
length +($a =~ /(\S+)$/)[0]
<=>
length +($b =~ /(\S+)$/)[0]
} <>' infile
With GNU awk:
WHINY_USERS= gawk 'END {
for (L in l) print l[L]
}
{
l[sprintf("%15s", length($NF))] = $0
}' infile
Upvotes: 1
Reputation: 342333
awk '{print length($NF) $0|"sort -n"}' file | sed 's/^.[^>]*>/>/'
Upvotes: 1
Reputation: 9709
This pipelined Command will figure out the length also.My Unix is a bit rusty have been doing other things for a while
$ awk '{printf("%d %s\n", length($NF), $0)}' junk.lst|sort -n -k1,1|sed 's/^[0-9]* //'
Upvotes: 3
Reputation: 1700
If the length is in the 4th column, sort -n -k4
should do the trick.
If the answer needs to figure out the length, then you're looking for a preprocessing step before sort. Perhaps python that just prints out the length of the 7th space separated column as a last or first column.
Upvotes: 6