Reputation: 1291

Difference between removing duplicates from a list using dict and set?

According to my research there are two easy ways to remove duplicates from a list:

a = list(dict.fromkeys(a))

and

a = list(set(a))

Is one of them more efficient than the other?

Upvotes: 4

Answers (4)

JL Meunier

Reputation: 481

CAUTION: since Python 3.7 the keys of a dict are ordered.

So the first form that uses

list(dict.fromkeys(a)) # preserves order!!

preserves the order while using the set will potentially (and probably) change the order of the elements of the list 'a'.

Upvotes: 3

ze ro

Reputation: 35

in case we have a list containing a = [1,16,2,3,4,5,6,8,10,3,9,15,7]

and we used a = list(set(a)) the set() function will drop the duplication's and also reorder our list, the new list will look like this [1,2,3,4,5,6,7,8,9,10,15,16], while if we use a = list(dict.fromkeys(a)) the dict.fromkeys() function will drop the duplication's and keep the list elements in the same order [1,16,2,3,4,5,6,8,10,9,15,7].

to sum things up, if you're looking for a way to drop duplications from a list without caring about reordering the list then set() is what you're looking for, but!! if keeping the order of the list is required, then you can use dict.fromkeys()

Upvotes: 3

Łukasz Obłąk

Reputation: 47

The second answer is way better not only because its faster, but it shows the intention of the programmer better. set() is designed specifically to describe mathematical sets in which elements cannot be duplicated, thus it fits this purpose and the intention is clear to the reader. On the other hand dict() is for storing key-value pairs and tells nothing about the intention.

Upvotes: 2

Tom Wojcik

Reputation: 6179

Definitely the second one is more efficient as sets are more or less created for that purpose and you skip the overhead related to creation of dict which is way heavier. Perfomance-wise it definitely depends on what the payload actually is.

import timeit
import random

input_data = [random.choice(range(100)) for i in range(1000)]

from_keys = timeit.timeit('list(dict.fromkeys(input_data))', number=10000, globals={'input_data': input_data})
from_set = timeit.timeit('list(set(input_data))', number=10000, globals={'input_data': input_data})

print(f"From keys performance: {from_keys:.3f}")
print(f"From set performance: {from_set:.3f}")

Prints:

From keys performance: 0.230
From set performance: 0.140

It doesn't really mean it's almost twice as fast. The difference is barely visible. Try it for yourself with different random data.

Upvotes: 3

Difference between removing duplicates from a list using dict and set?

Answers (4)

Related Questions