Reputation: 27949

Is there a better, pythonic way to do this?

This is my first python program -

Requirement: Read a file consisting of {adId UserId} in each line. For each adId, print the number of unique userIds.

Here is my code, put together from reading the python docs. Could you give me feedback on how I can write this in more python-ish way?

CODE :

import csv

adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
for row in reader:
    adId = row[0]
    userId = row[1]
    if ( adId in adDict ):
        adDict[adId].add(userId)
    else:
        adDict[adId] = set(userId)

for key, value in adDict.items():
    print (key, ',' , len(value))

Thanks.

Upvotes: 10

Answers (8)

unutbu

Reputation: 880587

Congratulations, your code is very nice. There are a few little tricks you could use to make it shorter/simpler.

There is a nifty object type called defaultdict which is provided by the collections module. Instead of having to check if adDict has an adId key, you can set up a defaultdict which acts like a regular dict, except that it automatically provides you with an empty set() when there is no key. So you can change

if ( adId in adDict ):
    adDict[adId].add(userId)
else:
    adDict[adId] = set(userId)

to simply

adDict[adId].add(userId)

Also, instead of

for row in reader:
    adId = row[0]
    userId = row[1]

you could shorten that to

for adId,userId in reader:

Edit: As Parker kindly points out in the comments,

for key, value in adDict.iteritems():

is the most efficient way to iterate over a dict, if you are going to use both the key and value in the loop. In Python3, you can use

for key, value in adDict.items():

since items() returns an iterator.

#!/usr/bin/env python
import csv
from collections import defaultdict

adDict = defaultdict(set)
reader = csv.reader(open("some.csv"), delimiter=' ')
for adId,userId in reader:
    adDict[adId].add(userId)
for key,value in adDict.iteritems():
    print (key, ',' , len(value))

Upvotes: 18

Tom Leys

Reputation: 19029

There are some great answers in here.

One trick I particularly like is to make my code easier to reuse in future like so

import csv

def parse_my_file(file_name):
     # some existing code goes here
     return aDict

if __name__ == "__main__":
     #this gets executed if this .py file is run directly, rather than imported
     aDict = parse_my_file("some.csv")
     for key, value in adDict.items():
         print (key, ',' , len(value))

Now you can import your csv parser from another module and get programmatic access to aDict.

Upvotes: 3

Alex Martelli

Reputation: 882451

the line of code:

adDict[adId] = set(userId)

is unlikely to do what you want -- it will treat string userId as a sequence of letters, so for example if userId was aleax you'd get a set with four items, just like, say, set(['a', 'l', 'e', 'x']). Later, an .add(userId) when userId is aleax again will add a fifth item, the string 'aleax', because .add (differently from the set initializer, which takes an iterable as its argument) takes a single item as its argument.

To make a set with a single item, use set([userId]) instead.

This is a reasonably frequent bug so I wanted to explain it clearly. That being said, defaultdict as suggested in other answers is clearly the right approach (avoid setdefault, that was never a good design and doesn't have good performance either, as well as being pretty murky).

I would also avoid the kinda-overkill of csv in favor of a simple loop with a .split and .strip on each line...

Upvotes: 10

Daniel Pryden

Reputation: 60997

Since you only have a space-delimited file, I'd do:

from __future__ import with_statement
from collections import defaultdict

ads = defaultdict(set)
with open("some.csv") as f:
    for ad, user in (line.split(" ") for line in f):
        ads[ad].add(user)

for ad in ads:
    print "%s, %s" % (ad, len(ads[ad]))

Upvotes: 3

John Millikin

Reputation: 200926

Instead of:

for row in reader:
    adId = row[0]
    userId = row[1]

Use automatic sequence unpacking:

for (adId, userId) in reader:

In:

if ( adId in adDict ):

You don't need parentheses.

Instead of:

if ( adId in adDict ):
    adDict[adId].add(userId)
else:
    adDict[adId] = set(userId)

Use defaultdict:

from collections import defaultdict
adDict = defaultDict(set)

# ...

adDict[adId].add(userId)

Or, if you're not allowed to use other modules by your professor, use setdefault():

adDict.setdefault(adId, set()).add(userId)

When printing:

for key, value in adDict.items():
    print (key, ',' , len(value))

Using string formatting might be easier to format:

print "%s,%s" % (key, len(value))

Or, if you're using Python 3:

print ("{0},{1}".format (key, len(value)))

Upvotes: 3

gak

Reputation: 32783

Just a few bits and pieces:

For extracting the row list into variables:

adId, userId = row

The if statement does not need braces:

if adId in adDict:

You could use exceptions to handle a missing Key in the dict, but both ways work well, e.g.:

try:
    adDict[adId].add(userId)
except KeyError:
    adDict[adId] = set(userId)

Upvotes: 1

Smashery

Reputation: 59673

The only changes I'd make are extracting multiple elements from the reader at once, and using string formatting for print statements.

import csv

adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
# Can extract multiple elements from a list in the iteration statement:
for adId, userId in reader: 
    if ( adId in adDict ):
        adDict[adId].add(userId)
    else:
        adDict[adId] = set(userId)

for key, value in adDict.items():
    # I believe this gives you more control over how things are formatted:
    print ("%s, %d" % (key, len(value)))

Upvotes: 1

sth

Reputation: 229794

You could shorten the for-loop to this:

for row in reader:
  adDict.setdefault(row[0], set()).add(row[1])

Upvotes: 7

Is there a better, pythonic way to do this?

Answers (8)

Related Questions