Reputation: 25580
I am new to python and am learning how to do things the right way.
I have list of dictionaries d
. Each dictionary represents users, and contains information like user_id, age, etc. This list d
can contain several dictionaries that represent the same user (but with slightly different information that does not matter for my purposes). I want to create histogram that shows how many users are in d
with given age. How to do it in efficient way?
Edit: I want to emphasise that I need to eliminate duplicates in the list.
Upvotes: 4
Views: 5969
Reputation: 16451
Trying to improve on @senderle's answer, hopefully I understood the problem better.
I assume the list contains dictionaries, where the keys are user IDs, and the data are objects which have the age
property:
import collections
# Merge all dictionaries to one uid->age mapping (I'm sure there's a shorter way)
all_ages={}
for d1 in d:
for uid,data in d1.iteritems():
all_ages[uid]=data.age
# Count distinct users per age
histogram = collections.defaultdict(int)
for uid,age in all_ages.iteritems():
histogram[age]+=1
Upvotes: -2
Reputation: 151157
Well, the classic approach to this problem would be to create a defaultdict:
import collections
histogram = collections.defaultdict(int)
Then iterate over the dictionaries in the list, and (using d_list
instead of d
as the name of the list of dictionaries),
for d in d_list:
histogram[d['age']] += 1
But you included additional information that confuses me. You said multiple dicts could represent the same user. Do you want to eliminate those duplicates from the histogram? If that's your question, one approach would be to store the users in a dict of user_records
using (firstname, lastname)
tuples as keys. Then successive dictionaries representing the same user would smash one another and only one record per user would be preserved. Then iterate over the values in that dictionary (perhaps using user_records.itervalues()
).
This general approach can be modified to use whatever values in each record best identifies unique users. If the user_id
value is unique per user, then use that as the key instead of (firstname, lastname)
. But your question suggested (to me) that the user_id
wouldn't necessarily be the same for two users who are the same.
Once you have the eliminated duplicates, though, there's also a shortcut if you're using Python >= 2.7:
histogram = collections.Counter(d['age'] for d in user_records.itervalues())
Some example code... say we have a record_list
:
>>> record_list
[{'lastname': 'Mann', 'age': 23, 'firstname': 'Joe'},
{'lastname': 'Moore', 'age': 23, 'firstname': 'Alex'},
{'lastname': 'Sault', 'age': 33, 'firstname': 'Marie'},
{'lastname': 'Mann', 'age': 23, 'firstname': 'Joe'}]
>>> user_ages = dict(((d['firstname'], d['lastname']), d['age']) for d in record_list)
>>> user_ages
{('Joe', 'Mann'): 23, ('Alex', 'Moore'): 23, ('Marie', 'Sault'): 33}
As you can see, the record_list
has a duplicate, but the user_ages
dict doesn't. Now getting a count of ages is as simple as running the values through a Counter
.
>>> collections.Counter(user_ages.itervalues())
Counter({23: 2, 33: 1})
The same thing can be done with any string or immutable object that can serve as a unique identifier of a particular user.
Upvotes: 3
Reputation: 40424
You could use itertools.groupby
to group in lists all the dictionaries that have the same age and, after that, just calculate the length of those lists.
For example:
import itertools
l = [{'user_id': 1, 'age': 20},
{'user_id': 2, 'age': 21},
{'user_id': 3, 'age': 21},
{'user_id': 4, 'age': 20},
{'user_id': 5, 'age': 21},
{'user_id': 6, 'age': 21},
]
def get_age(d):
return d.get('age')
print [(age, len(list(group)))
for age, group in itertools.groupby(sorted(l, key=get_age),
key=get_age)]
Example output:
[(20, 2), (21, 5)]
Note: As pointed out by @Dougal, the list must be sorted
. Otherwise itertools.groupby
won't work as expected.
Upvotes: 2