Lucky
Lucky

Reputation: 711

de-dup list of tuples while preserving order of initial appearance

I have a list of tuples which might have duplicates. I need to remove the duplicates (in the strict sense, e.g., after the initial appearance of a given tuple.) However, the gimmicks I found online using set-based list comprehensions don't work for me because I need to preserve the order:

>>> x = [(4,0),(4,0),(1,3) ...]
>>> x = [t for t in (set(tuple(i) for i in x))]
>>> print(x)
[(1,3), (4,0)] # bad, order was trashed!

I tried something like this:

>>> d = {}
>>> for t in x:
...     if not d[t]:
...             newlist.append(d[t])
...             d[t] = 1
# then we have newlist with one instance of every tuple in x.

I haven't got the syntax right but what I mean is, create a dictionary with a key for the first time a tuple in x is found. Is this possible? Is there a more pythonic way?

Upvotes: 1

Views: 26

Answers (1)

Daniil Fajnberg
Daniil Fajnberg

Reputation: 18598

Aside from the dict.fromkeys-method suggested by @andrej-kesely, I can think of three different solutions to this problem.

I know you did not ask for this, but I wrote a little test script for fun to see, how the various correct approaches hold up in terms of execution time. This should work under Python 3.7+:

import sys
from random import randint
from timeit import timeit
from typing import List, Tuple


TuplesT = List[Tuple[int, ...]]
TUP_LEN = 2
MIN, MAX = 0, 9


def get_tuples(n: int) -> TuplesT:
    """Randomly generates a list of `n` tuples."""
    return [
        tuple(randint(MIN, MAX) for __ in range(TUP_LEN))
        for _ in range(n)
    ]


def new_list_contains(tuples: TuplesT) -> TuplesT:
    new = []
    for tup in tuples:
        if tup not in new:
            new.append(tup)
    return new


def helper_set_lookup(tuples: TuplesT) -> TuplesT:
    new = []
    already_seen = set()
    for tup in tuples:
        if tup not in already_seen:
            already_seen.add(tup)
            new.append(tup)
    return new


def dict_from_keys(tuples: TuplesT) -> TuplesT:
    return list(dict.fromkeys(tuples))


def in_place_with_set_lookup(tuples: TuplesT) -> TuplesT:
    to_delete = []
    already_seen = set()
    for i in range(len(tuples)):
        if tuples[i] in already_seen:
            to_delete.append(i)
        else:
            already_seen.add(tuples[i])
    for i in reversed(to_delete):
        del tuples[i]
    return tuples  # return it as well to make function comparisons easier


FUNCTIONS = (
    new_list_contains,  # reference implementation
    helper_set_lookup,
    dict_from_keys,
    in_place_with_set_lookup,
)


def test_correctness(tuples: TuplesT) -> None:
    correct_list = new_list_contains(tuples)
    for func in FUNCTIONS[1:]:
        assert correct_list == func(tuples.copy()), f"{func.__name__} incorrect!"


def main(n: int) -> None:
    test_correctness(test_tuples)
    for func in FUNCTIONS:
        name = func.__name__
        t = timeit(
            stmt=f"{name}(tuples_copy)",
            setup=f"from __main__ import test_tuples, {name};"
                  f"tuples_copy = test_tuples.copy()",
            number=n,
        )
        print(f"{name:<30} {round(t, 4)} seconds")


if __name__ == '__main__':
    num_tuples, num_repeats = sys.argv[1:3]
    test_tuples = get_tuples(int(num_tuples))
    main(int(num_repeats))

The script needs to be called with two arguments, the first being the number of test tuples to randomly generate in advance and the second being the number of repetitions to use for each function.

Here are some non-scientific results from my environment.

num_tuples, num_repeats = 100, 100000:

new_list_contains              3.2087 seconds
helper_set_lookup              0.6152 seconds
dict_from_keys                 0.3408 seconds
in_place_with_set_lookup       0.4801 seconds

num_tuples, num_repeats = 1000, 10000:

new_list_contains              5.0503 seconds
helper_set_lookup              0.3682 seconds
dict_from_keys                 0.3088 seconds
in_place_with_set_lookup       0.0792 seconds

num_tuples, num_repeats = 100000, 100:

new_list_contains              5.5827 seconds
helper_set_lookup              0.3122 seconds
dict_from_keys                 0.3023 seconds
in_place_with_set_lookup       0.0114 seconds

The dict.fromkeys-method suggested by @andrej-kesely appears to be pretty fast, but an in-place implementation utilizing a set as cache seems to be significantly faster still. This difference becomes more pronounced, the larger the number of tuples in the list becomes.

Additionally, I would argue that the "in-place" approach is more memory efficient, because the cache set and index list likely use less memory than a whole additional dictionary and a whole additional list.

If your list of tuples isn't very large, I would still definitely recommend the approach via dictionary keys because "readability counts" and it is just hilariously more concise and elegant than any other method that comes to mind.

Upvotes: 1

Related Questions