abn
abn

Reputation: 1373

Extract top 20 (descending) rows of a CSV file with respect to a column

I have a CSV file with 3 columns that looks like this:

a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7

I'd like the output to be like

a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5

i.e., for each element in column a, I'd like to have top 20 (20 highest 'b' values) rows only. Please excuse my poor explanation. I've tried this so far but that doesn't give me my required output:

import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
    csvin = csv.reader(fin)
    rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
    for row in rows_with_mut:
        cnt = freqs.setdefault(row[0], [[]] * 20)
        heapq.heappushpop(cnt, row)

for assay_id, vals in freqs.iteritems():
    output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
    writer.writerows(output)

Upvotes: 0

Views: 1049

Answers (2)

ventsyv
ventsyv

Reputation: 3532

since the file is only sorted on column a, you have to sort it on column b & c as well. I suggest using natsort, sorting the file in ascending or descending order than looping over it and printing 20 rows for each value of column a.

Something along the lines of:

import natsort

with open('myfile.csv', 'r') as inFile:
    lines = inFile.readlines()
    sortedList = reversed(natsort.natsorted(lines))
    #alternatively, you might want to try natsort.versorted() which is used for version numbers
    counter = 0
    prevAVal=currentAval=1
    for line in sortedList:
        currentAVal = ",".split(line)[0]
        if currentAVal != prevAval:
            counter = 0
        if counter < 20 :
                print line
        counter = counter + 1
        prevAVal=currentAVal

Upvotes: 2

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476709

On the risk of downvoting, you could use a simple bash script:

#!/bin/bash
all=$(cat) #read from stdin
echo "$all" | head -n 1 #echo the header of the file
allt=$(echo "$all" | tail -n +2) #remove the header from memory
avl=$(echo "$allt" | cut -d ',' -f 1 | sort | uniq) #find all unique values in the a column
for av in $avl #iterate over these values
do
    echo "$allt" | grep "^$av," | sort -t$',' -k2nr | head -n 20 #for each value, find all lines with that value and sort them, return the top 20...
done

You can run this in the command line with:

bash script.sh < data.csv

It will print the result on the terminal...

Example:

If one uses your sample values (without the "dot"-rows), one obtains:

user@machine ~> bash script.sh < data.csv 
a,b,c
1,5,7
1,3,5
1,1,2
2,4,7
2,3,4
2,1,5

If you want to write the result to a file (say data2.csv) use:

bash script.sh < data.csv > data2.csv

Don't read and write to the same file: don't run bash script.sh < data.csv > data.csv.

Upvotes: 1

Related Questions