Reputation: 1499
I have a txt file like this:
ID row1 row2 row3 score
rs16 ... ... ... 0.23
rs52 ... ... ... 1.43
rs87 ... ... ... 0.45
rs89 ... ... ... 2.34
rs67 ... ... ... 1.89
Rows1- row3 do not matter.
I have about 8 million rows, and the scores range from 0-3. I would like to the score that correlates with being the top 1%. I was thinking of re-ordering the data by score and then printing the ~80,000 line? What do you guys think would be the best code for this?
Upvotes: 3
Views: 222
Reputation: 47099
With GNU coreutils you can do it like this:
sort -k5gr <(tail -n+2 infile) | head -n80KB
You can increase to speed of the above pipeline by removing columns 2 through 4 like this:
tr -s ' ' < infile | cut -d' ' -f1,5 > outfile
Or taken together:
sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB
I noticed that you are only interested in the 80000th line of the result, then sed -n 80000 {p;q}
instead of head
as you suggested, is the way to go.
tail:
-n+2
- skip header.sort:
k5
- sort on 5th column.gr
- flags that make sort choose reverse general-numeric-sort.head:
n
- number of lines to keep. KB
is a 1000 multiplier, see info head
for others.Upvotes: 2
Reputation: 47099
With GNU awk you can sort the values by setting the PROCINFO["sorted_in"]
to "@val_num_desc"
. For example like this:
parse.awk
# Set sorting method
BEGIN { PROCINFO["sorted_in"]="@val_num_desc" }
# Print header
NR == 1 { print $1, $5 }
# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }
# Print values from g and h until ratio is reached
END {
for(k in h) {
if(i++ >= int(0.5 + NR*ratio_to_keep))
exit
print g[k], h[k]
}
}
Run it like this:
awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile
Upvotes: 0