Creating a dict of empty sets through a loop, then filling those sets using another loop - more efficient method? (Python)

Question

In this answer, I am creating a dict of empty sets by iterating through a list. Then I iterate through the same list & filling those sets. An MRE:

# imports we need
import time
import numpy as np
np.random.seed(42)

What I am doing

An example list of letters. Note that at least one letter will appear more than once.

letters=[np.random.choice([letter for letter in string.ascii_lowercase]) for _ in range(1000)]

Result:

['w', 'n', 'k', 'o', 'm', 'r', ...

Creating a dict with letters as keys, empty sets as values:

letterdict={letter:set() for letter in letters}

Iterating through the letters list again, each entry in the list with the corresponding letter will be a set where the indices of that letter appears in the letters list:

for index, letter in enumerate(letters):
    letterdict[letter].add(index)

letterdict will look like:

{'w': {0, 12, 62, 67, 69, ...

How long it was

This process took:

start = time.time()
letterdict={letter:set() for letter in letters}
for index, letter in enumerate(letters):
    letterdict[letter].add(index)
end = time.time()
print(end-start)

0.000538... sec.

The question

Is there a way to make the creation of letterdict quicker? Afterall, I am iterating through letters twice.

My thoughts: If I could make it in one loop, when it encounters a letter for the first time, I could create a set, and put the index of the letter in it. When encountering the letter for the second time, it could not reset the set, just add the index. However, checking that a letter is already encountered or not seems tedious (ie slow).

In the MRE, assume, that we do not know what all the letters are, so replacing the first loop by {letter:set() for letter in string.ascii_lowercase} is not really useful.

gilch · Accepted Answer

You probably want a collections.defaultdict.

This will create the empty set on lookup if the key is not already present:

from collections import defaultdict

letterdict = defaultdict(set)
for index, letter in enumerate(letters):
     letterdict[letter].add(index)

This way you don't have to initialize the dictionary with empty sets.

You could instead just use the .setdefault() method on a normal dict.

letterdict = {}
for index, letter in enumerate(letters):
     letterdict.setdefault(letter, set()).add(index)

This has the overhead of creating a new set object each time whether it's needed or not, but the builtin set() types seem to construct pretty quickly. It wasn't any slower on the small samples I used to time it. CPython might pool these somehow when they get deleted quickly, like it does for tuples.

Creating a dict of empty sets through a loop, then filling those sets using another loop - more efficient method? (Python)

What I am doing

How long it was

The question

Answers (2)

Related Questions