PhD
PhD

Reputation: 11334

Why is converting a list to a set faster than using just list to compute a list difference?

Say, I wish to compute the difference of two lists C = A - B:

A = [1,2,3,4,5,6,7,8,9] 
B = [1,3,5,8,9]
C = [2,4,6,7]          #Result

A and B are both sorted with unique integers (not sure if there is a way to tell Python about this property of the list). I need to preserve the order of the elements. AFAIK there are two possible ways of doing it

Method 1: Convert B into a set and use list comprehension to generate C:

s = set(B)
C = [x for x in A if x not in s]

Method 2: Directly use list comprehension:

C = [x for x in A if x not in B]

Why is #1 more efficient than #2? Isn't there an overhead to convert to a set? What am I missing here?

Some performance benchmarks are given in this answer.

UPDATE: I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?

Upvotes: 15

Views: 12669

Answers (3)

TheSoundDefense
TheSoundDefense

Reputation: 6935

There is overhead to convert a list to a set, but a set is substantially faster than a list for those in tests.

You can instantly see if item x is in set y because there's a hash table being used underneath. No matter how large your set is, the lookup time is the same (basically instantaneous) - this is known in Big-O notation as O(1). For a list, you have to individually check every element to see if item x is in list z. As your list grows, the check will take longer - this is O(n), meaning the length of the operation is directly tied to how long the list is.

That increased speed can offset the set creation overhead, which is how your set check ends up being faster.

EDIT: to answer that other question, Python has no way of determining that your list is sorted - not if you're using a standard list object, anyway. So it can't achieve O(log n) performance with a list comprehension. If you wanted to write your own binary search method which assumes the list is sorted, you can certainly do so, but O(1) beats O(log n) any day.

EDIT 2:

I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?

No, not at all. Creating a set out of a list is a O(n) operation, as inserting an item into a set is O(1) and you're doing that n times. If you have a list with a million integers in it, converting it into a set involves two O(n) steps, while repeatedly scanning the list is going to be n O(n) steps. In practice, creating the set is going to be about 250,000 times faster for a list with a million integers, and the speed difference will grow larger and larger the more items you have in your list.

Upvotes: 18

user3881791
user3881791

Reputation:

According to the Python documentation on time complexity

  • List membership x in s is on average linear-time operation, or O(n).
  • Set membership x in s is on average constant-time operation, or O(1).

Building a set is worst-case linear-time operation, because one would need to scan all the elements in a list to build a hash-table, so O(n). n is number of elements in a collection.

The key observation is that, in Method 1, building a set, s = set(B) is just a one-off operation, then after that we just have n total number of set-membership test as in x not in B, so in total O(n) + n * O(1), or O(n) time complexity.

Whereas in Method 2, the list-membership test x not in B is carried out for each element in A, so in total n * O(n) = O(n^2) time complexity.

Upvotes: 9

Pankaj Sharma
Pankaj Sharma

Reputation: 679

Average time complexity for lookup (x in S) in a set is O(1) while the same for a list is O(n).

You can check the details at https://wiki.python.org/moin/TimeComplexity

Upvotes: 8

Related Questions