Extract top 20 (descending) rows of a CSV file with respect to a column

Question

I have a CSV file with 3 columns that looks like this:

a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7

I'd like the output to be like

a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5

i.e., for each element in column a, I'd like to have top 20 (20 highest 'b' values) rows only. Please excuse my poor explanation. I've tried this so far but that doesn't give me my required output:

import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='
', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
    csvin = csv.reader(fin)
    rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
    for row in rows_with_mut:
        cnt = freqs.setdefault(row[0], [[]] * 20)
        heapq.heappushpop(cnt, row)

for assay_id, vals in freqs.iteritems():
    output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
    writer.writerows(output)

willeM_ Van Onsem · Accepted Answer

On the risk of downvoting, you could use a simple bash script:

#!/bin/bash
all=$(cat) #read from stdin
echo "$all" | head -n 1 #echo the header of the file
allt=$(echo "$all" | tail -n +2) #remove the header from memory
avl=$(echo "$allt" | cut -d ',' -f 1 | sort | uniq) #find all unique values in the a column
for av in $avl #iterate over these values
do
    echo "$allt" | grep "^$av," | sort -t$',' -k2nr | head -n 20 #for each value, find all lines with that value and sort them, return the top 20...
done

You can run this in the command line with:

bash script.sh < data.csv

It will print the result on the terminal...

Example:

If one uses your sample values (without the "dot"-rows), one obtains:

user@machine ~> bash script.sh < data.csv 
a,b,c
1,5,7
1,3,5
1,1,2
2,4,7
2,3,4
2,1,5

If you want to write the result to a file (say data2.csv) use:

bash script.sh < data.csv > data2.csv

Don't read and write to the same file: don't run bash script.sh < data.csv > data.csv.

Extract top 20 (descending) rows of a CSV file with respect to a column

Answers (2)

Related Questions