Reputation: 26258
Given a list
old_list = [obj_1, obj_2, obj_3, ...]
I want to create a list:
new_list = [[obj_1, obj_2], [obj_3], ...]
where obj_1.some_attr == obj_2.some_attr
.
I could throw some for
loops and if
checks together, but this is ugly. Is there a pythonic way for this? by the way, the attributes of the objects are all strings.
Alternatively a solution for a list containing tuples (of the same length) instead of objects is appreciated, too.
Upvotes: 65
Views: 61831
Reputation: 301
Recently, I have also faced a similar issue. Thank you for the solutions provided above. I wrote a small comparison on the computation times of the above mentioned methods. In my implementation I keep the dictionary as it is nice to see the keys as well.
The method with defaultdict
won.
from collections import defaultdict
import time
import itertools
import pandas as pd
import random
class Person:
def __init__(self,name,age):
self.name=name
self.age=age
def __repr__(self):
return f"Person(name='{self.name}', age={self.age})"
def method_with_dict(people):
groups={}
for person in people:
if person.age in groups:
groups[person.age].append(person)
else:
groups[person.age]=[person]
return groups
def method_with_defaultdict(people):
groups=defaultdict(list)
for person in people:
groups[person.age].append(person)
return groups
def group_by_age_with_itertools(people):
people.sort(key=lambda x: x.age)
groups={}
for age,group in itertools.groupby(people,key=lambda x: x.age):
groups[age]=list(group)
return groups
def group_by_age_with_pandas(people):
df=pd.DataFrame([(p.name,p.age) for p in people],columns=["Name","Age"])
groups=df.groupby("Age")["Name"].apply(list).to_dict()
return {k: [Person(name,k) for name in v] for k,v in groups.items()}
if __name__ == "__main__":
num_people=1000
min_age,max_age=18,80
people=[Person(name=f"Person {i}",age=random.randint(min_age,max_age)) for i in
range(num_people)]
N=10000
start_time=time.time()
for i in range(N):
result_defaultdict=method_with_defaultdict(people)
end_time=time.time()
print(f"method_with_defaultdict: {end_time - start_time:.6f} seconds")
start_time=time.time()
for i in range(N):
result_dict=method_with_dict(people)
end_time=time.time()
print(f"method_with_dict: {end_time - start_time:.6f} seconds")
start_time=time.time()
for i in range(N):
result_itertools=group_by_age_with_itertools(people)
end_time=time.time()
print(f"method_with_itertools: {end_time - start_time:.6f} seconds")
start_time=time.time()
for i in range(N):
result_pandas=group_by_age_with_pandas(people)
end_time=time.time()
print(f"method_with_pandas: {end_time - start_time:.6f} seconds")
method_with_defaultdict: 0.954309 seconds
method_with_dict: 1.301710 seconds
method_with_itertools: 1.868009 seconds
method_with_pandas: 34.422366 seconds
Upvotes: 2
Reputation: 391846
defaultdict
is how this is done.
While for
loops are largely essential, if
statements aren't.
from collections import defaultdict
groups = defaultdict(list)
for obj in old_list:
groups[obj.some_attr].append(obj)
new_list = groups.values()
Upvotes: 104
Reputation: 21089
Here are two cases. Both require the following imports:
import itertools
import operator
You'll be using itertools.groupby and either operator.attrgetter or operator.itemgetter.
For a situation where you're grouping by obj_1.some_attr == obj_2.some_attr
:
get_attr = operator.attrgetter('some_attr')
new_list = [list(g) for k, g in itertools.groupby(sorted(old_list, key=get_attr), get_attr)]
For a[some_index] == b[some_index]
:
get_item = operator.itemgetter(some_index)
new_list = [list(g) for k, g in itertools.groupby(sorted(old_list, key=get_item), get_item)]
Note that you need the sorting because itertools.groupby
makes a new group when the value of the key changes.
Note that you can use this to create a dict
like S.Lott's answer, but don't have to use collections.defaultdict
.
Using a dictionary comprehension (only works with Python 3+, and possibly Python 2.7 but I'm not sure):
groupdict = {k: g for k, g in itertools.groupby(sorted_list, keyfunction)}
For previous versions of Python, or as a more succinct alternative:
groupdict = dict(itertools.groupby(sorted_list, keyfunction))
Upvotes: 42
Reputation: 29103
Think you can also try to use itertools.groupby. Please note that code below is just a sample and should be modified according to your needs:
data = [[1,2,3],[3,2,3],[1,1,1],[7,8,9],[7,7,9]]
from itertools import groupby
# for example if you need to get data grouped by each third element you can use the following code
res = [list(v) for l,v in groupby(sorted(data, key=lambda x:x[2]), lambda x: x[2])]# use third element for grouping
Upvotes: 16