Tiger1
Tiger1

Reputation: 1377

How to speed up list comprehension

Below is my list:

col = [['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
       ['pink', 'orange', 'brown', 'pink', 'brown']
      ]

My goal is to eliminate items that appear once in each list.

Here is my code:

eliminate = [[w for w in c if c.count(w)>1]for c in col]

Output: [['red', 'red', 'yellow','yellow'], ['pink','pink', 'brown','brown']]

The code works fine for small dataset such as the list above, however, my dataset is very large. Each list contains up to a 1000 items.

Is there a way to make the above code work faster? like breaking the code down into two or more for-loops, as my understanding is that the normal for-loop is faster than list comprehension.

Any suggestions? thanks.

Upvotes: 3

Views: 2595

Answers (4)

Martijn Pieters
Martijn Pieters

Reputation: 1121884

Don't use .count() as it scans through your list for each element. Moreover, it'll add items to the output multiple times if they appear 3 or more times in the input.

You'd be better off using a generator function here that only yields items it has seen before, but only once:

def unique_plurals(lst):
    seen, seen_twice = set(), set()
    seen_add, seen_twice_add = seen.add, seen_twice.add
    for item in lst:
        if item in seen and item not in seen_twice:
            seen_twice_add(item)
            yield item
            continue
        seen_add(item)

[list(unique_plurals(c)) for c in col]

This iterates just once through each list (unlike using a Counter()).

This method is far faster:

>>> timeit('[[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, OrderedCounter')
52.00807499885559
>>> timeit('[[k for k, v in Counter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, Counter')
15.766052007675171
>>> timeit('[list(unique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals')
6.946599006652832
>>> timeit('[list(unique_plurals_dict(c)) for c in col]', 'from __main__ import col, unique_plurals_dict')
6.557853937149048

That's about 8 times faster than the OrderedCounter method, 2.2 times the Counter method.

Jon's one-dictionary-plus-counter method is faster still, though!

However, if you just needed to eliminate values that appear only once but keep the rest intact including repeats, then you'd use (borrowing from Jon):

from itertools import count
from collections import defaultdict

def nonunique_plurals(lst):
    seen = defaultdict(count)
    for item in lst:
        cnt = next(seen[item])
        if cnt:
            if cnt == 1:
                # yield twice to make up for skipped first item
                yield item
            yield item

This produces:

>>> [list(nonunique_plurals(c)) for c in col]
[['red', 'red', 'yellow', 'yellow'], ['pink', 'pink', 'brown', 'brown']]
>>> timeit('[non_uniques(c) for c in col]', 'from __main__ import col, non_uniques')
17.75499200820923
>>> timeit('[list(nonunique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals')
9.306739091873169

That's almost twice the speed of the Counter() solution proposed by FMc, but it doesn't retain order as precisely:

>>> list(nonunique_plurals(['a', 'a', 'b', 'a', 'b', 'c']))
['a', 'a', 'a', 'b', 'b']
>>> non_uniques(['a', 'a', 'b', 'a', 'b', 'c'])
['a', 'a', 'b', 'a', 'b']

Upvotes: 7

FMc
FMc

Reputation: 42411

This addresses your revised question: it does make two passes over the inner lists (first to count, then to retrieve) and thus isn't as fast as possible; however, it preserves order and is very readable. As usual, trade-offs abound!

from collections import Counter

cols = [
    ['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
    ['pink', 'orange', 'brown', 'pink', 'brown'],
]

def non_uniques(vals):
    counts = Counter(vals)
    return [v for v in vals if counts[v] > 1]

non_uniqs = map(non_uniques, cols)

# [
#    ['red', 'yellow', 'red', 'yellow'],
#    ['pink', 'brown', 'pink', 'brown'],
# ]

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142146

I'd have a go at trying an OrderedCounter to avoid the repeated .count() calls:

from collections import OrderedDict, Counter

col=[['red', 'yellow', 'blue', 'red', 'green', 'yellow'],['pink', 'orange', 'brown', 'pink', 'brown']]

class OrderedCounter(Counter, OrderedDict):
    pass

new = [[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]
# [['red', 'yellow'], ['pink', 'brown']]

And if we just wish to iterate once, then (similar to Martijn's - plus playing less with sets):

from itertools import count
def unique_plurals(iterable):
    seen = {}
    return [el for el in iterable if next(seen.setdefault(el, count())) == 1]

new = map(unique_plurals, col)

This is more flexible in terms of specifying how many occurrences are required, and keeps one dict instead of multiple sets.

Upvotes: 9

Marcin
Marcin

Reputation: 49826

my understanding is that the normal for-loop is faster than list comprehension.

Nope.

Your loop is slow because it duplicates operations. For every string in each nested list in col it performs a count of the number of instances of that string, so for each c in col, it performs len(c)**2 comparisons. That's an O(NM^2) squared algorithm. That gets slow quickly.

To make this faster, use better datastructures. Use collections.Counter.

Upvotes: 2

Related Questions