Reputation: 35760
I would like to loop through a big two dimension list:
authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"], ... ]
and get a list that contains all the names that occurs in authors.
When I loop through the list, I need a container to store names I've already seen, I'm wondering if I should use a list or a dict:
with a list:
seen = []
for author_list in authors:
for author in author_list:
if not author in seen:
seen.append(author)
result = seen
with a dict:
seen = {}
for author_list in authors:
for author in author_list:
if not author in seen:
seen[author] = True
result = seen.keys()
which one is faster? or is there better solutions?
Upvotes: 0
Views: 2583
Reputation: 12486
You really want a set
. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set
) in O(1)
time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (in O(n)
time.)
A dict
is similar to a set
in that both allow unique keys only, and both are implemented as hash tables. They both allow O(1)
membership testing. The difference is that a set
only has keys, while a dict
has both keys and values (which is extra overhead you don't need in this application.)
Using a set
, and replacing the nested for loop with an itertools.chain()
to flatten the 2D list to a 1D list:
import itertools
seen = set()
for author in itertools.chain(*authors):
seen.add(author)
Or shorter:
import itertools
seen = set( itertools.chain(*authors) )
Edit (thanks, @jamylak) more memory efficient for large lists:
import itertools
seen = set( itertools.chain.from_iterable(authors) )
Example on a list of lists:
>>> a = [[1,2],[1,2],[1,2],[3,4]]
>>> set ( itertools.chain(*a) )
set([1, 2, 3, 4])
P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a collections.Counter
, a special kind of dictionary optimised for counting things.
Here's an example of counting characters in a string:
>>> a = "DEADBEEF CAFEBABE"
>>> import collections
>>> collections.Counter(a)
Counter({'E': 5, 'A': 3, 'B': 3, 'D': 2, 'F': 2, ' ': 1, 'C': 1})
Upvotes: 8
Reputation: 71610
Lists just store a bunch of items in a particular order. Think of your list of authors as a long line of pigeonhole boxes with author's names on bits of papers in the boxes. The names stay in the order you put them in, and you can find the author in any particular pigeonhole very easily, but if you want to know if a particular author is in any pigeonhole, then you have to look through each one until you find the name you're after. You can also have the same name in any number of pigeonholes.
Dictionaries are a bit more like a phone book. Given the author's name, you can very quickly check to see whether the author is listed in the phone book, and find the phone number listed with it. But you can only include each author once (with exactly one phone number), and you can't put the authors in there in any order you like, they have to be in the order that makes sense for the phone book. In a real phone book, that order is alphabetical; in Python dictionaries the order is completely unpredictable (and it changes when you add or remove things to the dictionary), but Python can find entries even faster in a dictionary than it could in a phone book.
Sets, on the other hand, are like phone books that just list names, not phone numbers. You still can't have the same name included more than once, it's either in the set or not. And you still can't use the order in which names are in the set for anything useful. But you can very quickly check whether a name is in the set.
Given your use case, a set would appear to be the obvious choice. You don't care about the order in which you've seen authors, or how many times you've seen each author, only that you can quickly check whether you've seen a particular author before.
Lists are bad for this case; they go to the effort of remembering duplicates in whatever order you specify, and they're slow to search. But you also don't have any need to map keys to values, which is what a dictionary does. To go back to the phone book analogy, you don't have anything equivalent to a "phone number"; in your dictionary example you're doing the equivalent of writing a phone book in which everybody's number is listed as True
, so why bother listing the phone numbers at all?
A set, OTOH, does exactly what you need.
Upvotes: 1
Reputation: 4448
If you care about the performance of lookups, lookups in lists are O(n), while lookups in dictionaries are amortised to O(1).
You can find more information here.
Upvotes: 1
Reputation: 69092
using a dict
or a set
is way faster then using a list
import itertools
result = set(itertools.chain.from_iterable(authors))
Upvotes: 3
Reputation: 18038
set
should be faster.
>>> authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"]]
>>> from itertools import chain
>>> set(chain(*authors))
set(['Lisa', 'Bob', 'Jim', 'Molly', 'Alice'])
Upvotes: 3
Reputation: 7867
You can use set -
from sets import Set
seen = Set()
for author_list in authors:
for author in author_list:
seen.add(author)
result = seen
This way you are escaping the "if" checking, hence solution would be faster.
Upvotes: 2