Order a column then print a certain row with awk in command line

Question

I have a txt file like this:

ID   row1   row2   row3   score
rs16 ...    ...    ...    0.23
rs52 ...    ...    ...    1.43
rs87 ...    ...    ...    0.45
rs89 ...    ...    ...    2.34
rs67 ...    ...    ...    1.89

Rows1- row3 do not matter.

I have about 8 million rows, and the scores range from 0-3. I would like to the score that correlates with being the top 1%. I was thinking of re-ordering the data by score and then printing the ~80,000 line? What do you guys think would be the best code for this?

Thor · Accepted Answer

With GNU coreutils you can do it like this:

sort -k5gr <(tail -n+2 infile) | head -n80KB

You can increase to speed of the above pipeline by removing columns 2 through 4 like this:

tr -s ' ' < infile | cut -d' ' -f1,5 > outfile

Or taken together:

sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB

Edit

I noticed that you are only interested in the 80000th line of the result, then sed -n 80000 {p;q} instead of head as you suggested, is the way to go.

Explanation

tail:

-n+2 - skip header.

sort:

k5 - sort on 5th column.
gr - flags that make sort choose reverse general-numeric-sort.

head:

n - number of lines to keep. KB is a 1000 multiplier, see info head for others.

Order a column then print a certain row with awk in command line

Answers (2)

Edit

Explanation

Related Questions