Reputation: 33
I have a list that contains multiple columns, and I need to group and sum rows based on two columns. Can I do this without using a Pandas dataframe?
I have a dataset in a list like this:
User Days Project
Dave 3 Red
Dave 4 Red
Dave 2 Blue
Sue 4 Red
Sue 1 Red
Sue 3 Yellow
Specifically:
[[Dave, 3, Red], [Dave, 4, Red], [Dave, 2, Blue], [Sue, 4, Red], [Sue, 1, Red], [Sue, 3, Yellow]]
What I want to do is output on the same line some totals like this:
User Days Project UserDays ProjectDaysPerUser
Dave 3 Red 9 7
Dave 4 Red 9 7
Dave 2 Blue 9 2
Sue 4 Red 8 5
Sue 1 Red 8 5
Sue 3 Yellow 8 3
So I'm trying to group twice to get the "ProjectDaysPerUser", first by user, then by project. It's this double grouping that's throwing me off.
Is there an easy way to do this without creating a Panda dataframe?
Upvotes: 1
Views: 2838
Reputation: 85
Efficient Code
import itertools
def group_data(input1:list)->list:
name_dict = {k : sum(v[1] for v in g) for k, g in itertools.groupby(sorted(input1, key=lambda x:x[0]), key=lambda x:x[0])}
name_colour_dict = {k: sum(v[1] for v in g) for k,g in itertools.groupby(sorted(input1, key=lambda x:(x[0], x[2])), key=lambda x:(x[0],x[2]))}
for row in input1:
name = row[0]
name_colour = (row[0], row[2])
row.append(name_dict[name])
row.append(name_colour_dict[name_colour])
print(input1)
group_data([['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
)
Upvotes: 0
Reputation: 101
use dictionary for improved performance
data = [['Dave', 3, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Dave', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
sum_dict = {}
for d in data:
sum_dict[d[0]] = sum_dict.get(d[0], 0) + d[1]
sum_dict[(d[0], d[2])] = sum_dict.get((d[0], d[2]), 0) + d[1]
for d in data:
d.append(sum_dict[d[0]])
d.append(sum_dict[(d[0], d[2])])
print(d)
Upvotes: 2
Reputation: 6348
Because you're doing sums, this can also be solved nicely with collections.Counter
:
from collections import Counter
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
user_days = Counter()
project_user_days = Counter()
for (name, num_days, project) in data:
user_days[name] += num_days
project_user_days[(name, project)] += num_days
derived_data = [
[name, num_days, project, user_days[name], project_user_days[(name, project)]]
for (name, num_days, project) in data
]
import pprint
pprint.pprint(derived_data)
# [['Dave', 3, 'Red', 9, 7],
# ['Dave', 4, 'Red', 9, 7],
# ['Dave', 2, 'Blue', 9, 2],
# ['Sue', 4, 'Red', 8, 5],
# ['Sue', 1, 'Red', 8, 5],
# ['Sue', 3, 'Yellow', 8, 3]]
Upvotes: 0
Reputation: 12704
Below script is using groupby and appending the result of the sum to the list.
from itertools import groupby
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
new_data, final = [], []
userDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: x[0])]
projuserDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: (x[0], x[2]))]
#add userDays and projectuserdays
for d in data:
for u in userDays:
if d[0]==u[0]:
d.append(u[1])
new_data.append(d)
for p in projuserDays:
if d[0]==p[0][0] and d[2]==p[0][1]:
d.append(p[1])
final.append(d)
print(final)
Result:
[['Dave', 3, 'Red', 9, 7],
['Dave', 4, 'Red', 9, 7],
['Dave', 2, 'Blue', 9, 2],
['Sue', 4, 'Red', 8, 5],
['Sue', 1, 'Red', 8, 5],
['Sue', 3, 'Yellow', 8, 3]]
Upvotes: 2