Reputation: 85
Given a large dataset of one million records, I am looking for ways to do a group by. I am new to python, but i know in SQL there's a groupby function and i am guessing it might be applicable.
What i want to achieve is this,
From
["A", 4]
["B", 4]
["F", 3]
["A", 4]
["B", 1]
To
["A", (4,4)]
["B", (1,4)]
["F", (3)]
I am also looking for an efficient way to calculate the average of the list of ratings. So finally the output should be:
["A", 4]
["B", 2.5]
["F", 3]
I've tried to do a iterative approach to it but the error thrown was "there was too much data to unpack". Here is my solution which is not workng for the dataset.
len = max(key for (item, key) in results)
newList = [[] for i in range(len+1)]
for item, key in results:
newList[key].append(item)
I am looking for efficient way to do it, is there a way to do a groupby in list comprehension? Thanks!
Upvotes: 0
Views: 453
Reputation: 7806
You may want to get used to dealing with a spreadsheet type interface for data like this. This is a larger implementation than you asked for but then graphing and charting will be easier in the long run. This example uses pandas and numpy.
Highlight the data from this question and copy it:
name value
A 4
B 4
F 3
A 4
B 1
You can get into ipython and start typing the setup for this operation.
import pandas as pd
import numpy as np
data= pd.from_clipboard()
Now comes the interesting part. you can use a pivot table which can group all those values by any function you want.
pd.pivot_table(data=data, index='name', aggfunc=np.mean)
which returns
value
name
A 4.0
B 2.5
F 3.0
Upvotes: 0
Reputation: 5560
There is indeed a groupby
method in itertools
, just be aware that it requires the data to be sorted beforehand, see the documentation here https://docs.python.org/2/library/itertools.html#itertools.groupby
But from the code you posted, it looks like you don't really need to group, you just want to count, right? Then you may better use collections.Counter
. Note that it requires the items to be hashable so you'd want to convert those lists into tuples.
>>> lst = [tuple(i) for i in ls]
>>> collections.Counter(lst)
Counter({('A', 4): 2, ('F', 3): 1, ('B', 1): 1, ('B', 4): 1})
Regarding efficiency... Not sure you'd fare very well loading the whole dataset in memory but you could use the defaultdict
approach described by Vlad with an iterator.
About the averages, if you really want to use groupby
then you could do something like this:
>>> def average(lst):
... return 1.0*sum(lst)/len(lst) if lst else 0.0
>>> [(i[0],average([j[1] for j in i[1]])) for i in itertools.groupby(sorted(ls),key=lambda i:i[0])]
[('A', 4.0), ('B', 2.5), ('F', 3.0)]
Upvotes: 2
Reputation: 18633
I think the following would be a reasonable approach for a small dataset.
from collections import defaultdict
ls = [
["A", 4],
["B", 4],
["F", 3],
["A", 4],
["B", 1],
]
d = defaultdict(list)
for key, val in ls:
d[key].append(val)
# Prints [['A', (4, 4)], ['B', (4, 1)], ['F', (3,)]]
print [[k, tuple(l)] for k, l in d.items()]
# prints [['A', 4.0], ['B', 2.5], ['F', 3.0]]
print [[k, float(sum(l))/len(l)] for k, l in d.items()] #*
*In Python 2.x, use iteritems()
instead of items()
, see this answer.
Slightly better, if you only care about the averages, you won't need to store everything that a key maps to:
d = defaultdict(lambda: (0, 0))
for key, val in ls:
cnt, total = d[key]
d[key] = (cnt + 1, total + val)
print [[k, float(total) / cnt] for k, (cnt, total) in d.items()]
Upvotes: 3