ashim
ashim

Reputation: 25580

python: creating histogram out of dictionary

I am new to python and am learning how to do things the right way.

I have list of dictionaries d. Each dictionary represents users, and contains information like user_id, age, etc. This list d can contain several dictionaries that represent the same user (but with slightly different information that does not matter for my purposes). I want to create histogram that shows how many users are in d with given age. How to do it in efficient way?

Edit: I want to emphasise that I need to eliminate duplicates in the list.

Upvotes: 4

Views: 5969

Answers (3)

ugoren
ugoren

Reputation: 16451

Trying to improve on @senderle's answer, hopefully I understood the problem better.

I assume the list contains dictionaries, where the keys are user IDs, and the data are objects which have the age property:

import collections
# Merge all dictionaries to one uid->age mapping (I'm sure there's a shorter way)
all_ages={}
for d1 in d:
   for uid,data in d1.iteritems():
       all_ages[uid]=data.age
# Count distinct users per age
histogram = collections.defaultdict(int)
for uid,age in all_ages.iteritems():
    histogram[age]+=1

Upvotes: -2

senderle
senderle

Reputation: 151157

Well, the classic approach to this problem would be to create a defaultdict:

import collections
histogram = collections.defaultdict(int)

Then iterate over the dictionaries in the list, and (using d_list instead of d as the name of the list of dictionaries),

for d in d_list:
    histogram[d['age']] += 1

But you included additional information that confuses me. You said multiple dicts could represent the same user. Do you want to eliminate those duplicates from the histogram? If that's your question, one approach would be to store the users in a dict of user_records using (firstname, lastname) tuples as keys. Then successive dictionaries representing the same user would smash one another and only one record per user would be preserved. Then iterate over the values in that dictionary (perhaps using user_records.itervalues()).

This general approach can be modified to use whatever values in each record best identifies unique users. If the user_id value is unique per user, then use that as the key instead of (firstname, lastname). But your question suggested (to me) that the user_id wouldn't necessarily be the same for two users who are the same.

Once you have the eliminated duplicates, though, there's also a shortcut if you're using Python >= 2.7:

histogram = collections.Counter(d['age'] for d in user_records.itervalues())

Some example code... say we have a record_list:

>>> record_list
[{'lastname': 'Mann', 'age': 23, 'firstname': 'Joe'}, 
 {'lastname': 'Moore', 'age': 23, 'firstname': 'Alex'}, 
 {'lastname': 'Sault', 'age': 33, 'firstname': 'Marie'}, 
 {'lastname': 'Mann', 'age': 23, 'firstname': 'Joe'}]
>>> user_ages = dict(((d['firstname'], d['lastname']), d['age']) for d in record_list)
>>> user_ages
{('Joe', 'Mann'): 23, ('Alex', 'Moore'): 23, ('Marie', 'Sault'): 33}

As you can see, the record_list has a duplicate, but the user_ages dict doesn't. Now getting a count of ages is as simple as running the values through a Counter.

>>> collections.Counter(user_ages.itervalues())
Counter({23: 2, 33: 1})

The same thing can be done with any string or immutable object that can serve as a unique identifier of a particular user.

Upvotes: 3

jcollado
jcollado

Reputation: 40424

You could use itertools.groupby to group in lists all the dictionaries that have the same age and, after that, just calculate the length of those lists.

For example:

import itertools

l = [{'user_id': 1, 'age': 20},
     {'user_id': 2, 'age': 21},
     {'user_id': 3, 'age': 21},
     {'user_id': 4, 'age': 20},
     {'user_id': 5, 'age': 21},
     {'user_id': 6, 'age': 21},
     ]

def get_age(d):
    return d.get('age')

print [(age, len(list(group)))
       for age, group in itertools.groupby(sorted(l, key=get_age),
                                           key=get_age)]

Example output:

[(20, 2), (21, 5)]

Note: As pointed out by @Dougal, the list must be sorted. Otherwise itertools.groupby won't work as expected.

Upvotes: 2

Related Questions