Reputation: 1373
I'm trying to get the first occurrences of each row of a CSV in Python. However, I'm facing an issue. My CSV files looks like this:
1,2,3,a,7,5,y,0
1,2,3,a,3,5,y,8
1,2,3,a,5,3,y,7
1,2,3,d,7,5,n,0
1,2,3,d,3,5,n,8
1,2,3,d,5,3,n,7
2,3,4,f,4,6,y,9
2,3,4,f,5,6,y,9
2,3,4,f,7,3,y,9
2,3,4,e,3,5,n,9
2,3,4,e,0,7,n,9
2,3,4,e,5,8,n,9
I tried this way to get the first occurrences of unique values based on one of the columns.
def unique():
rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
columns = zip(*rows)
uniq = set(columns[1])
indexed = defaultdict(list)
for x in uniq:
i = columns[1].index(x)
indexed[i] = rows[i]
return indexed
It works fine for one unique column value set. However,
1,2,3,d,7,5,n,0,a 2,3,4,e,3,5,n,9,f
Upvotes: 1
Views: 5378
Reputation: 1
Old topic, but could be useful for other: why not call the external uniq
command if you are in a Unix environment? That way you would not have to reinvent this code and would benefit from a potentially better performance.
Upvotes: 0
Reputation: 51990
There are some room for improvement in your code, but I didn't want to rewrite it in depth, as you had it almost right. The "key" point is that you need a compound key. This is the pair (r[1],r[6])
that has to be unique. In addition, I took the liberty to use an OrderedDict
for fast-lookup, but preserving the row order.
import csv
import collections
def unique():
rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
result = collections.OrderedDict()
for r in rows:
key = (r[1],r[6]) ## The pair (r[1],r[6]) must be unique
if key not in result:
result[key] = r
return result.values()
from pprint import pprint
pprint(unique())
Producing:
[['1', '2', '3', 'a', '7', '5', 'y', '0'],
['1', '2', '3', 'a', '7', '5', 'n', '0'],
['2', '3', '4', 'f', '4', '6', 'y', '9'],
['2', '3', '4', 'f', '3', '5', 'n', '9']]
Upvotes: 3
Reputation: 15160
Here's an alternate implementation.
Each row is read in from the data set. We use a defaultdict(list)
to store all rows, based on each rows two-column index. As a row is read in from the dataset, it's appended to the defaultdict
based on that row's two-column index key.
At the end, we scan through the defaultdict
. We want the first row from the dataset that matched the index, so we return uniq[0]
that corresponds to the two-column index.
import csv
from collections import defaultdict
def unique():
uniq = defaultdict(list)
for row in csv.reader(open('try.csv', 'r'), delimiter=','):
uniq[ (row[0],row[6]) ].append(row)
for idx,row in uniq.iteritems():
yield row[0]
print list( unique() )
[['2', '3', '4', 'f', '4', '6', 'y', '9'], ['2', '3', '4', 'f', '3', '5', 'n', '9'], ['1', '2', '3', 'a', '7', '5', 'y', '0'], ['1', '2', '3', 'a', '7', '5', 'n', '0']]
Upvotes: 1