Reputation: 27949
This is my first python program -
Requirement: Read a file consisting of {adId UserId} in each line. For each adId, print the number of unique userIds.
Here is my code, put together from reading the python docs. Could you give me feedback on how I can write this in more python-ish way?
CODE :
import csv
adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
for row in reader:
adId = row[0]
userId = row[1]
if ( adId in adDict ):
adDict[adId].add(userId)
else:
adDict[adId] = set(userId)
for key, value in adDict.items():
print (key, ',' , len(value))
Thanks.
Upvotes: 10
Views: 706
Reputation: 880587
Congratulations, your code is very nice. There are a few little tricks you could use to make it shorter/simpler.
There is a nifty object type called defaultdict which is provided by the collections module. Instead of having to check if adDict has an adId key, you can set up a defaultdict which acts like a regular dict, except that it automatically provides you with an empty set() when there is no key. So you can change
if ( adId in adDict ):
adDict[adId].add(userId)
else:
adDict[adId] = set(userId)
to simply
adDict[adId].add(userId)
Also, instead of
for row in reader:
adId = row[0]
userId = row[1]
you could shorten that to
for adId,userId in reader:
Edit: As Parker kindly points out in the comments,
for key, value in adDict.iteritems():
is the most efficient way to iterate over a dict, if you are going to use both the key and value in the loop. In Python3, you can use
for key, value in adDict.items():
since items() returns an iterator.
#!/usr/bin/env python
import csv
from collections import defaultdict
adDict = defaultdict(set)
reader = csv.reader(open("some.csv"), delimiter=' ')
for adId,userId in reader:
adDict[adId].add(userId)
for key,value in adDict.iteritems():
print (key, ',' , len(value))
Upvotes: 18
Reputation: 19029
There are some great answers in here.
One trick I particularly like is to make my code easier to reuse in future like so
import csv
def parse_my_file(file_name):
# some existing code goes here
return aDict
if __name__ == "__main__":
#this gets executed if this .py file is run directly, rather than imported
aDict = parse_my_file("some.csv")
for key, value in adDict.items():
print (key, ',' , len(value))
Now you can import your csv parser from another module and get programmatic access to aDict.
Upvotes: 3
Reputation: 882451
the line of code:
adDict[adId] = set(userId)
is unlikely to do what you want -- it will treat string userId
as a sequence of letters, so for example if userId
was aleax
you'd get a set with four items, just like, say, set(['a', 'l', 'e', 'x'])
. Later, an .add(userId)
when userId
is aleax
again will add a fifth item, the string 'aleax'
, because .add
(differently from the set initializer, which takes an iterable as its argument) takes a single item as its argument.
To make a set with a single item, use set([userId])
instead.
This is a reasonably frequent bug so I wanted to explain it clearly. That being said, defaultdict
as suggested in other answers is clearly the right approach (avoid setdefault
, that was never a good design and doesn't have good performance either, as well as being pretty murky).
I would also avoid the kinda-overkill of csv
in favor of a simple loop with a .split and .strip on each line...
Upvotes: 10
Reputation: 60997
Since you only have a space-delimited file, I'd do:
from __future__ import with_statement
from collections import defaultdict
ads = defaultdict(set)
with open("some.csv") as f:
for ad, user in (line.split(" ") for line in f):
ads[ad].add(user)
for ad in ads:
print "%s, %s" % (ad, len(ads[ad]))
Upvotes: 3
Reputation: 200926
Instead of:
for row in reader:
adId = row[0]
userId = row[1]
Use automatic sequence unpacking:
for (adId, userId) in reader:
In:
if ( adId in adDict ):
You don't need parentheses.
Instead of:
if ( adId in adDict ):
adDict[adId].add(userId)
else:
adDict[adId] = set(userId)
Use defaultdict
:
from collections import defaultdict
adDict = defaultDict(set)
# ...
adDict[adId].add(userId)
Or, if you're not allowed to use other modules by your professor, use setdefault()
:
adDict.setdefault(adId, set()).add(userId)
When printing:
for key, value in adDict.items():
print (key, ',' , len(value))
Using string formatting might be easier to format:
print "%s,%s" % (key, len(value))
Or, if you're using Python 3:
print ("{0},{1}".format (key, len(value)))
Upvotes: 3
Reputation: 32783
Just a few bits and pieces:
For extracting the row list into variables:
adId, userId = row
The if statement does not need braces:
if adId in adDict:
You could use exceptions to handle a missing Key in the dict, but both ways work well, e.g.:
try:
adDict[adId].add(userId)
except KeyError:
adDict[adId] = set(userId)
Upvotes: 1
Reputation: 59673
The only changes I'd make are extracting multiple elements from the reader at once, and using string formatting for print statements.
import csv
adDict = {}
reader = csv.reader(open("some.csv"), delimiter=' ')
# Can extract multiple elements from a list in the iteration statement:
for adId, userId in reader:
if ( adId in adDict ):
adDict[adId].add(userId)
else:
adDict[adId] = set(userId)
for key, value in adDict.items():
# I believe this gives you more control over how things are formatted:
print ("%s, %d" % (key, len(value)))
Upvotes: 1
Reputation: 229794
You could shorten the for-loop to this:
for row in reader:
adDict.setdefault(row[0], set()).add(row[1])
Upvotes: 7