user324810
user324810

Reputation: 606

Bash sort tab delimited rows based on specific column with most values delimited by comma

I have rows like so:

rs6605071   chr1:962943 XM_017002478.2  stuff1,stuff2                           morestuff
rs6605071   chr1:962943 XM_017002479.1  stuff1,stuff2,stuff3,stuff4,stuff5      morestuff
rs6605071   chr1:962943 XR_001737138.1  stuff1,stuff2,stuff3                    morestuff
rs6605071   chr1:962943 XR_001737478.1  stuff1,stuff2,stuff3,stuff4             morestuff
rs6605071   chr1:962943 NC_426604.3     stuff1                                  morestuff
rs6605071   chr1:962943 NC_426605.3     stuff1                                  morestuff

I would like to sort my rows by the 4th column for the desired output:

rs6605071   chr1:962943 XM_017002479.1  stuff1,stuff2,stuff3,stuff4,stuff5      morestuff
rs6605071   chr1:962943 XR_001737478.1  stuff1,stuff2,stuff3,stuff4             morestuff
rs6605071   chr1:962943 XM_017002478.2  stuff1,stuff2                           morestuff
rs6605071   chr1:962943 NC_426604.3     stuff1                                  morestuff
rs6605071   chr1:962943 NC_426605.3     stuff1                                  morestuff

What is the best approach to achieve such result in bash ?

Edit 1: The column 4 shouldn't be sorted alphabetically. It has to be sorted according to the number of values found (delimited by commas).

Thank you in advance

Upvotes: 0

Views: 52

Answers (1)

jasonmclose
jasonmclose

Reputation: 1695

So this is a bit hacky, but it works. I can't tell your delimeter (if it's tabs or spaces), but something like this will work, and allows for fairly easily manipulation:

 cat asdfasdf.txt | awk '{print gsub(/,/,","),$1,$2,$3,$4,$5}' | sort -r | cut -d' ' -f2,3,4,5,6

Now, there has got to be a way to do this entirely in awk, and I'm always in awe of the awk experts who know it so well.

I hope one of them puts together a more elegant command, but for now, this will help in a pinch.

Upvotes: 1

Related Questions