jihn04
jihn04

Reputation: 21

Find duplicates in a list of tuples

You are given information about users of your website. The information includes username, a phone number and/or an email. Write a program that takes in a list of tuples where each tuple represents information for a particular user and returns a list of lists where each sublist contains the indices of tuples containing information about the same person. For example:

Input:
[("MLGuy42", "[email protected]", "123-4567"),
("CS229DungeonMaster", "123-4567", "[email protected]"),
("Doomguy", "[email protected]", "[email protected]"),
("andrew26", "[email protected]", "[email protected]")]

Output:
[[0, 1, 3], [2]]

Since "MLGuy42", "CS229DungeonMaster" and "andrew26" are all the same person.

Each sublist in the output should be sorted and the outer list should be sorted by the first element in the sublist.


Below is the code snippet that I did for this problem. It seems to work fine, but I'm wondering if there is a better/optimized solution. Any help would be appreciated. Thanks!

def find_duplicates(user_info):
    results = list()
    seen = dict()
    for i, user in enumerate(user_info):
        first_seen = True
        key_info = None
        for info in user:
            if info in seen:
                first_seen = False
                key_info = info
                break
        if first_seen:
            results.append([i])
            pos = len(results) - 1
        else:
            index = seen[key_info]
            results[index].append(i)
            pos = index
        for info in user:
            seen[info] = pos
    return results

Upvotes: 0

Views: 318

Answers (2)

jihn04
jihn04

Reputation: 21

I think I've reached to an optimized working solution using a graph. Basically, I've created a graph with each node contains its user information and its index. Then, use dfs to traverse the graph and find the duplicates.

Upvotes: 1

cdlane
cdlane

Reputation: 41925

I think we can simplify this using sets:

from random import shuffle

def find_duplicates(user_info):

    reduced = unreduced = {frozenset(info): [i] for i, info in enumerate(user_info)}

    while reduced is unreduced or len(unreduced) > len(reduced):

        unreduced = dict(reduced)  # make a copy

        for identifiers_1, positions_1 in unreduced.items():

            for identifiers_2, positions_2 in unreduced.items():

                if identifiers_1 is identifiers_2:
                    continue

                if identifiers_1 & identifiers_2:
                    del reduced[identifiers_1], reduced[identifiers_2]
                    reduced[identifiers_1 | identifiers_2] = positions_1 + positions_2
                    break
            else:  # no break
                continue

            break

    return sorted(sorted(value) for value in reduced.values())

my_input = [ \
    ("CS229DungeonMaster", "123-4567", "[email protected]"), \
    ("Doomguy", "[email protected]", "[email protected]"), \
    ("andrew26", "[email protected]", "[email protected]"), \
    ("MLGuy42", "[email protected]", "123-4567"), \
]

shuffle(my_input)  # shuffle to prove order independence

print(my_input)
print(find_duplicates(my_input))

OUTPUT

> python3 test.py
[('CS229DungeonMaster', '123-4567', '[email protected]'), ('MLGuy42', '[email protected]', '123-4567'), ('andrew26', '[email protected]', '[email protected]'), ('Doomguy', '[email protected]', '[email protected]')]
[[0, 1, 2], [3]]
> 

Upvotes: 0

Related Questions