abn
abn

Reputation: 1373

Python to extract unique CSV rows

I'm trying to get the first occurrences of each row of a CSV in Python. However, I'm facing an issue. My CSV files looks like this:

1,2,3,a,7,5,y,0
1,2,3,a,3,5,y,8
1,2,3,a,5,3,y,7
1,2,3,d,7,5,n,0
1,2,3,d,3,5,n,8
1,2,3,d,5,3,n,7
2,3,4,f,4,6,y,9
2,3,4,f,5,6,y,9
2,3,4,f,7,3,y,9
2,3,4,e,3,5,n,9
2,3,4,e,0,7,n,9
2,3,4,e,5,8,n,9

I tried this way to get the first occurrences of unique values based on one of the columns.

def unique():
    rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
    columns = zip(*rows)
    uniq = set(columns[1])

    indexed = defaultdict(list)

    for x in uniq:
        i = columns[1].index(x)
        indexed[i] = rows[i]

    return indexed

It works fine for one unique column value set. However,

  1. I'd like to set columns[1] and columns[6] as unique values.
  2. The tricky part is columns[6] is always y or n. If I set that, it returns me only first y and n columns. I'd like to get all the columns that have both columns[1] and columns[6] as unique. For every columns[2] value, I need the first occurrence of y and n rows. Sorry for my poor description. So basically, I'd like my output to be like:
1,2,3,d,7,5,n,0,a
2,3,4,e,3,5,n,9,f

Upvotes: 1

Views: 5378

Answers (3)

Gilles Bertrand
Gilles Bertrand

Reputation: 1

Old topic, but could be useful for other: why not call the external uniq command if you are in a Unix environment? That way you would not have to reinvent this code and would benefit from a potentially better performance.

Upvotes: 0

Sylvain Leroux
Sylvain Leroux

Reputation: 51990

There are some room for improvement in your code, but I didn't want to rewrite it in depth, as you had it almost right. The "key" point is that you need a compound key. This is the pair (r[1],r[6]) that has to be unique. In addition, I took the liberty to use an OrderedDict for fast-lookup, but preserving the row order.

import csv
import collections

def unique():
    rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
    result = collections.OrderedDict()
    for r in rows:
        key = (r[1],r[6])  ## The pair (r[1],r[6]) must be unique
        if key not in result:
            result[key] = r

    return result.values()

from pprint import pprint
pprint(unique())

Producing:

[['1', '2', '3', 'a', '7', '5', 'y', '0'],
 ['1', '2', '3', 'a', '7', '5', 'n', '0'],
 ['2', '3', '4', 'f', '4', '6', 'y', '9'],
 ['2', '3', '4', 'f', '3', '5', 'n', '9']]

Upvotes: 3

johntellsall
johntellsall

Reputation: 15160

Here's an alternate implementation.

Each row is read in from the data set. We use a defaultdict(list) to store all rows, based on each rows two-column index. As a row is read in from the dataset, it's appended to the defaultdict based on that row's two-column index key.

At the end, we scan through the defaultdict. We want the first row from the dataset that matched the index, so we return uniq[0] that corresponds to the two-column index.

source

import csv
from collections import defaultdict
def unique():
    uniq = defaultdict(list)
    for row in csv.reader(open('try.csv', 'r'), delimiter=','):
        uniq[ (row[0],row[6]) ].append(row)

    for idx,row in uniq.iteritems():
        yield row[0]

print list( unique() )

output

[['2', '3', '4', 'f', '4', '6', 'y', '9'], ['2', '3', '4', 'f', '3', '5', 'n', '9'], ['1', '2', '3', 'a', '7', '5', 'y', '0'], ['1', '2', '3', 'a', '7', '5', 'n', '0']]

Upvotes: 1

Related Questions