Reputation: 127
I am looking to transform a list into smaller lists of equal values. An example I have is:
["a", "a", "a", "b", "b", "c", "c", "c", "c"]
to
[["a", "a", "a"], ["b", "b"], ["c", "c", "c", "c"]]
What do you think is the most efficient way to do this?
Upvotes: 2
Views: 3551
Reputation: 25809
While I'd personally go with itertools.groupby
as the most convinient way, you've asked for efficiency and this should be considerably faster than any of the itertools
options:
data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
lookup = {} # lookup map
result = []
for element in data:
if element not in lookup:
target = lookup[element] = [element]
result.append(target)
else:
lookup[element].append(element)
print(result)
# [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
If the data is always ordered (i.e. elements don't mix) this can be further optimized without a lookup table and using a list comprehension for maximum performance.
UPDATE - Some clarification on efficiency and operation. If you set up your test as:
from itertools import groupby
def itools_func(data):
return [list(grp) for k, grp in groupby(data)]
def manual_func(data):
lookup = {}
result = []
for element in data:
if element not in lookup:
target = lookup[element] = [element]
result.append(target)
else:
lookup[element].append(element)
return result
The problem is that those two will not return the same values:
test_data = ["a", "a", "b", "c", "c", "b", "a"]
itools_func(test_data) # [['a', 'a'], ['b'], ['c', 'c'], ['b'], ['a']]
manual_func(test_data) # [['a', 'a', 'a'], ['b', 'b'], ['c', 'c']]
From OP's question, I understood he wants the latter one (based on his comment "I sorted the list to make values consecutive") because with a sorted list this can be done far easier. So, if we feed those functions a really long list:
test_data = ["a", "a", "a", "b", "b", "c", "c", "c", "c"] * 10000 # 10000 x the original
On my system it clocks as follows:
itools_func - 100 loops: 2.668s, per loop: 26.68ms
manual_func - 100 loops: 1.005s, per loop: 10.05ms
But, this is an unfavorable setting for itertools.groopby
. If the data was to be sorted like:
test_data = ["a"] * 3000 + ["b"] * 2000 + ["c"] * 40000
The story is quite a bit different as the C backend kicks in:
itools_func - 1000 loops: 656.3ms, per loop: 656.3µs
manual_func - 1000 loops: 4.816s, per loop: 4.816ms
When the data is sorted the manual function can be further optimized but it will hardly beat what itertools
does under the hood.
Upvotes: 0
Reputation: 9267
Another manner to have your desired output is by using defaultdict
from collections
module (best time using this approach was: ~= 0.02s same as using groupby
):
from collections import defaultdict
a = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
b = defaultdict(list)
for k in a:
b[k].append(k)
>>> b
defaultdict(list,
{'a': ['a', 'a', 'a'], 'b': ['b', 'b'], 'c': ['c', 'c', 'c', 'c']})
So, what you have to do now is:
list(b.values())
>>> [['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
Upvotes: 2
Reputation: 5560
You could use collections.Counter
>>> lst = ["a", "a", "a", "b", "b", "c", "c", "c", "c"]
>>> import collections
>>> collections.Counter(lst).most_common()
[('c', 4), ('a', 3), ('b', 2)]
This works even when the values are not ordered and provides a very compact representation which then you could expand if needed into lists:
>>> [[i]*n for i,n in collections.Counter(lst).most_common()]
[['c', 'c', 'c', 'c'], ['a', 'a', 'a'], ['b', 'b']]
Upvotes: 5
Reputation: 152735
You could use itertools.groupby
to solve the problem:
>>> from itertools import groupby
>>> [list(grp) for k, grp in groupby(["a", "a", "a", "b", "b", "c", "c", "c", "c"])]
[['a', 'a', 'a'], ['b', 'b'], ['c', 'c', 'c', 'c']]
It only groups consecutive equal elements but that seems enough in your case.
Upvotes: 9