Reputation: 1377
Below is my list:
col = [['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
['pink', 'orange', 'brown', 'pink', 'brown']
]
My goal is to eliminate items that appear once in each list.
Here is my code:
eliminate = [[w for w in c if c.count(w)>1]for c in col]
Output: [['red', 'red', 'yellow','yellow'], ['pink','pink', 'brown','brown']]
The code works fine for small dataset such as the list above, however, my dataset is very large. Each list contains up to a 1000 items.
Is there a way to make the above code work faster? like breaking the code down into two or more for-loops, as my understanding is that the normal for-loop is faster than list comprehension.
Any suggestions? thanks.
Upvotes: 3
Views: 2595
Reputation: 1121884
Don't use .count()
as it scans through your list for each element. Moreover, it'll add items to the output multiple times if they appear 3 or more times in the input.
You'd be better off using a generator function here that only yields items it has seen before, but only once:
def unique_plurals(lst):
seen, seen_twice = set(), set()
seen_add, seen_twice_add = seen.add, seen_twice.add
for item in lst:
if item in seen and item not in seen_twice:
seen_twice_add(item)
yield item
continue
seen_add(item)
[list(unique_plurals(c)) for c in col]
This iterates just once through each list (unlike using a Counter()
).
This method is far faster:
>>> timeit('[[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, OrderedCounter')
52.00807499885559
>>> timeit('[[k for k, v in Counter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, Counter')
15.766052007675171
>>> timeit('[list(unique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals')
6.946599006652832
>>> timeit('[list(unique_plurals_dict(c)) for c in col]', 'from __main__ import col, unique_plurals_dict')
6.557853937149048
That's about 8 times faster than the OrderedCounter
method, 2.2 times the Counter
method.
Jon's one-dictionary-plus-counter method is faster still, though!
However, if you just needed to eliminate values that appear only once but keep the rest intact including repeats, then you'd use (borrowing from Jon):
from itertools import count
from collections import defaultdict
def nonunique_plurals(lst):
seen = defaultdict(count)
for item in lst:
cnt = next(seen[item])
if cnt:
if cnt == 1:
# yield twice to make up for skipped first item
yield item
yield item
This produces:
>>> [list(nonunique_plurals(c)) for c in col]
[['red', 'red', 'yellow', 'yellow'], ['pink', 'pink', 'brown', 'brown']]
>>> timeit('[non_uniques(c) for c in col]', 'from __main__ import col, non_uniques')
17.75499200820923
>>> timeit('[list(nonunique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals')
9.306739091873169
That's almost twice the speed of the Counter()
solution proposed by FMc, but it doesn't retain order as precisely:
>>> list(nonunique_plurals(['a', 'a', 'b', 'a', 'b', 'c']))
['a', 'a', 'a', 'b', 'b']
>>> non_uniques(['a', 'a', 'b', 'a', 'b', 'c'])
['a', 'a', 'b', 'a', 'b']
Upvotes: 7
Reputation: 42411
This addresses your revised question: it does make two passes over the inner lists (first to count, then to retrieve) and thus isn't as fast as possible; however, it preserves order and is very readable. As usual, trade-offs abound!
from collections import Counter
cols = [
['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
['pink', 'orange', 'brown', 'pink', 'brown'],
]
def non_uniques(vals):
counts = Counter(vals)
return [v for v in vals if counts[v] > 1]
non_uniqs = map(non_uniques, cols)
# [
# ['red', 'yellow', 'red', 'yellow'],
# ['pink', 'brown', 'pink', 'brown'],
# ]
Upvotes: 1
Reputation: 142146
I'd have a go at trying an OrderedCounter
to avoid the repeated .count()
calls:
from collections import OrderedDict, Counter
col=[['red', 'yellow', 'blue', 'red', 'green', 'yellow'],['pink', 'orange', 'brown', 'pink', 'brown']]
class OrderedCounter(Counter, OrderedDict):
pass
new = [[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]
# [['red', 'yellow'], ['pink', 'brown']]
And if we just wish to iterate once, then (similar to Martijn's - plus playing less with sets):
from itertools import count
def unique_plurals(iterable):
seen = {}
return [el for el in iterable if next(seen.setdefault(el, count())) == 1]
new = map(unique_plurals, col)
This is more flexible in terms of specifying how many occurrences are required, and keeps one dict
instead of multiple set
s.
Upvotes: 9
Reputation: 49826
my understanding is that the normal for-loop is faster than list comprehension.
Nope.
Your loop is slow because it duplicates operations. For every string in each nested list in col
it performs a count of the number of instances of that string, so for each c
in col
, it performs len(c)**2
comparisons. That's an O(NM^2)
squared algorithm. That gets slow quickly.
To make this faster, use better datastructures. Use collections.Counter
.
Upvotes: 2