Reputation: 10709
I have a large number of two-membered sub-lists that are members of a list called mylist
:
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 34439],
['XC013', 86701]]
I want to sort mylist
into new sub-lists based on whether the sub-lists contain the same string as the first item. For example, this is what I am looking for my code to output:
newlist = [['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]
Here is how I was trying to code this:
mylist = sorted(mylist)
newlist = []
for sublist in mylist:
id = sublist[0]
if id == next.id:
newlist.append(id)
print newlist
I was also trying to understand if itertools.groupby()
was the correct tool for this problem. Can someone help me with this problem?
Upvotes: 4
Views: 454
Reputation: 62513
.get
to determine if a key exists, and return some specified value, None
in this case, if the key is nonexistent.dict.get
defaults to None
, so this method never raises a KeyError.
None
is a value in the dictionary, then change the default value returned by .get
.
test.get(t[0], 'something here')
key
, and then add the list
, t
, as the dict
value.test = dict()
for t in mylist:
if test.get(t[0]) == None:
test[t[0]] = [t]
else:
test[t[0]].append(t)
final = list(test.values())
# print final results in
[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]]
Upvotes: 0
Reputation: 26916
There are a number of alternatives to solve this problem:
def regroup_by_di(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
if key_value not in result:
result[key_value] = []
result[key_value].append(item)
return result
import collections
def regroup_by_dd(items, key=None):
result = collections.defaultdict(list)
callable_key = callable(key)
for item in items:
result[key(item) if callable_key else item].append(item)
return dict(result) # to be in line with other solutions
def regroup_by_sd(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
result.setdefault(key_value, []).append(item)
return result
import itertools
def regroup_by_it(items, key=None):
seq = sorted(items, key=key)
result = {
key_value: list(group)
for key_value, group in itertools.groupby(seq, key)}
return result
def group_by(
seq,
key=None):
items = iter(seq)
try:
item = next(items)
except StopIteration:
return
else:
callable_key = callable(key)
last = key(item) if callable_key else item
i = j = 0
for i, item in enumerate(items, 1):
current = key(item) if callable_key else item
if last != current:
yield last, seq[j:i]
last = current
j = i
if i >= j:
yield last, seq[j:i + 1]
def regroup_by_gb(items, key=None):
return dict(group_by(sorted(items, key=key), key))
These can be divided into two categories:
dict
-like structure (regroup_by_di()
, regroup_by_dd()
, regroup_by_sd()
)uniq
-like function (e.g. itertools.groupby()
) (regroup_by_it()
, regroup_by_gb()
)The first class of approaches has O(n)
computational complexity, while the second class of approaches has O(n log n)
.
All of the proposed approach require specifying a key
.
For OP's problem, operators.itemgetter(0)
or lambda x: x[0]
would work. Additionally, to get OP's desired results one should get only the list(dict.values())
, e.g.:
from operator import itemgetter
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 4439],
['XC013', 86701]]
print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]
The timings come out as faster for all dict
-based (1st class) solutions and slower for all groupby
-based (2nd class) solutions.
Within the dict
-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object.
For higher collision rates, the regroup_by_di()
may be the fastest, while for lower collision rates the regroup_by_dd()
may be the fastest.
The benchmarks come out as follow:
(More details available here.)
Upvotes: 2
Reputation: 176940
You were right about this being a job for groupby
:
from itertools import groupby
from operator import itemgetter
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 4439],
['XC013', 86701]]
print([list(value) for key, value in groupby(sorted(mylist), key=itemgetter(0))])
This will give you a list-of-lists, grouped by the first item in the sublist.
[[['AB001', 4439], ['AB001', 22100], ['AB001', 32935]],
[['VD126', 18884]],
[['XC013', 86701], ['XC013', 99834]]]
Upvotes: 6
Reputation: 164783
collections.defaultdict
An itertools.groupby
solution will incur O(n log n) cost since the input must be sorted first. You can a use defaultdict
of lists for a guaranteed O(n) solution:
from collections import defaultdict
dd = defaultdict(list)
for item in mylist:
dd[item[0]].append(item)
res = list(dd.values())
print(res)
[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]]
Upvotes: 3