nikiforosb
nikiforosb

Reputation: 23

How to get unique values from a csv file

I have this csv file

Cat, and, dog, bites
Yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, offspring
Cat, killer, likely, is, a, big, dog
Professional, free, advice, on, dog, training, puppy, training
Cat, and, kitten, training, and, behavior
Dog, &, Cat, provides, dog, training, in Eugene, Oregon
Dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male-female, relationship
Shop, for, your, show, dog, grooming, and, pet, supplies

I want to make all the words start with a small letter and create a list which will include all the unique items from the above csv file. Have you any idea? Thanks in advance! So far, I have managed to convert all the words with a small letter:

unique_row_items = set([field.strip().lower() for field in row])

But i can't manage the other one.

def unique():

    rows = list(csv.reader(open('example_1.csv', 'r'), delimiter=','))

    result = []

    for r in rows:
        key = r
        if key not in result:
            result.append(r)
    return result

Which does not give the results I want

Upvotes: 0

Views: 14028

Answers (2)

abarnert
abarnert

Reputation: 365657

If you can't figure out how to do everything at once, do it step by step.

So, let's write an explicit for statement over the rows:

result = []
# use `with` so the file gets closed
with open('example_1.csv', 'r') as f:
    # no need for `list` here
    rows = csv.reader(f, delimiter=',')
    for row in rows:
        # no need for `set([...])`, just `set(...)`
        unique_row_items = set(field.strip().lower() for field in row)
        for item in unique_row_items:
            if item not in result:
                result.append(item)

But if you look at this, you're trying to use a list as a set; it'll be easier (and more efficient) if you just use a set as a set; then you don't need the if … in check:

result = set()
with open('example_1.csv', 'r') as f:
    # no need for `list` here
    rows = csv.reader(f, delimiter=',')
    for row in rows:
        unique_row_items = set(field.strip().lower() for field in row)
        for item in unique_row_items:
            result.add(item)

And now, adding each element from one set to another is just unioning the sets, so you can replace those last two lines with, e.g.:

result |= unique_row_items

And now, if you want to turn it all back into one big expression, you can:

with open('example_1.csv', 'r') as f:
    result = set.union(*(set(field.strip().lower() for field in row)
                         for row in csv.reader(f, delimiter=',')))

Also, in Python 2.7+, you can just use a set comprehension, instead of calling set on a list comprehension or generator expression:

with open('example_1.csv', 'r') as f:
    result = set.union(*({field.strip().lower() for field in row}
                         for row in csv.reader(f, delimiter=',')))

In fact, you can even turn the whole thing into one big comprehension with a nested loop:

with open('example_1.csv', 'r') as f:
    result = {field.strip().lower() 
              for row in csv.reader(f, delimiter=',')
              for field in row}

Or, alternatively, you don't have to make it one big expression:

with open('example_1.csv', 'r') as f:
    rows = csv.reader(f, delimiter=',')
    rowsets = ({field.strip().lower() for field in row} for row in rows)
    result = set.union(*rowsets)

Also, as Padraic Cunningham pointed out, one of the dialect options the csv module offers is skipinitialspace, which does just like it sounds like, so you don't need the strip anymore. For example, using the big set comprehension:

with open('example_1.csv', 'r') as f:
    result = {field.lower() 
              for row in csv.reader(f, delimiter=',', skipinitialspace=True)
              for field in row}

Or, alternatively, it looks like your format is really using comma-space rather than comma as a delimiter, so:

with open('example_1.csv', 'r') as f:
    result = {field.lower() 
              for row in csv.reader(f, delimiter=', ')
              for field in row}

Upvotes: 7

ZdaR
ZdaR

Reputation: 22954

To store all the words in lowercase , you can use .lower() method on strings and after creating a list of all the words in the list we create a set which returns only the unique values.

with open("data_file.csv", "r") as data_file:
    all_words = []
    for line in data_file.readlines():
        for word in line.split(","):
            all_words.append(word.lower())

unique_words = set(all_words)
print unique_words

Upvotes: 2

Related Questions