Reputation: 1373
I have a CSV file with 3 columns that looks like this:
a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7
I'd like the output to be like
a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5
i.e., for each element in column a, I'd like to have top 20 (20 highest 'b' values) rows only. Please excuse my poor explanation. I've tried this so far but that doesn't give me my required output:
import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
csvin = csv.reader(fin)
rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
for row in rows_with_mut:
cnt = freqs.setdefault(row[0], [[]] * 20)
heapq.heappushpop(cnt, row)
for assay_id, vals in freqs.iteritems():
output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
writer.writerows(output)
Upvotes: 0
Views: 1049
Reputation: 3532
since the file is only sorted on column a, you have to sort it on column b & c as well. I suggest using natsort, sorting the file in ascending or descending order than looping over it and printing 20 rows for each value of column a.
Something along the lines of:
import natsort
with open('myfile.csv', 'r') as inFile:
lines = inFile.readlines()
sortedList = reversed(natsort.natsorted(lines))
#alternatively, you might want to try natsort.versorted() which is used for version numbers
counter = 0
prevAVal=currentAval=1
for line in sortedList:
currentAVal = ",".split(line)[0]
if currentAVal != prevAval:
counter = 0
if counter < 20 :
print line
counter = counter + 1
prevAVal=currentAVal
Upvotes: 2
Reputation: 476709
On the risk of downvoting, you could use a simple bash script:
#!/bin/bash
all=$(cat) #read from stdin
echo "$all" | head -n 1 #echo the header of the file
allt=$(echo "$all" | tail -n +2) #remove the header from memory
avl=$(echo "$allt" | cut -d ',' -f 1 | sort | uniq) #find all unique values in the a column
for av in $avl #iterate over these values
do
echo "$allt" | grep "^$av," | sort -t$',' -k2nr | head -n 20 #for each value, find all lines with that value and sort them, return the top 20...
done
You can run this in the command line with:
bash script.sh < data.csv
It will print the result on the terminal...
Example:
If one uses your sample values (without the "dot"-rows), one obtains:
user@machine ~> bash script.sh < data.csv
a,b,c
1,5,7
1,3,5
1,1,2
2,4,7
2,3,4
2,1,5
If you want to write the result to a file (say data2.csv
) use:
bash script.sh < data.csv > data2.csv
Don't read and write to the same file: don't run bash script.sh < data.csv > data.csv
.
Upvotes: 1