Evan
Evan

Reputation: 1499

Order a column then print a certain row with awk in command line

I have a txt file like this:

ID   row1   row2   row3   score
rs16 ...    ...    ...    0.23
rs52 ...    ...    ...    1.43
rs87 ...    ...    ...    0.45
rs89 ...    ...    ...    2.34
rs67 ...    ...    ...    1.89

Rows1- row3 do not matter.

I have about 8 million rows, and the scores range from 0-3. I would like to the score that correlates with being the top 1%. I was thinking of re-ordering the data by score and then printing the ~80,000 line? What do you guys think would be the best code for this?

Upvotes: 3

Views: 222

Answers (2)

Thor
Thor

Reputation: 47099

With GNU coreutils you can do it like this:

sort -k5gr <(tail -n+2 infile) | head -n80KB

You can increase to speed of the above pipeline by removing columns 2 through 4 like this:

tr -s ' ' < infile | cut -d' ' -f1,5 > outfile

Or taken together:

sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB

Edit

I noticed that you are only interested in the 80000th line of the result, then sed -n 80000 {p;q} instead of head as you suggested, is the way to go.

Explanation

tail:

  • -n+2 - skip header.

sort:

  • k5 - sort on 5th column.
  • gr - flags that make sort choose reverse general-numeric-sort.

head:

  • n - number of lines to keep. KB is a 1000 multiplier, see info head for others.

Upvotes: 2

Thor
Thor

Reputation: 47099

With GNU awk you can sort the values by setting the PROCINFO["sorted_in"] to "@val_num_desc". For example like this:

parse.awk

# Set sorting method
BEGIN { PROCINFO["sorted_in"]="@val_num_desc" }

# Print header
NR == 1 { print $1, $5 }

# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }

# Print values from g and h until ratio is reached
END {
  for(k in h) { 
    if(i++ >= int(0.5 + NR*ratio_to_keep)) 
      exit
    print g[k], h[k]
  }
}

Run it like this:

awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile

Upvotes: 0

Related Questions