Reputation: 21
You are given information about users of your website. The information includes username, a phone number and/or an email. Write a program that takes in a list of tuples where each tuple represents information for a particular user and returns a list of lists where each sublist contains the indices of tuples containing information about the same person. For example:
Input:
[("MLGuy42", "[email protected]", "123-4567"),
("CS229DungeonMaster", "123-4567", "[email protected]"),
("Doomguy", "[email protected]", "[email protected]"),
("andrew26", "[email protected]", "[email protected]")]
Output:
[[0, 1, 3], [2]]
Since "MLGuy42", "CS229DungeonMaster" and "andrew26" are all the same person.
Each sublist in the output should be sorted and the outer list should be sorted by the first element in the sublist.
Below is the code snippet that I did for this problem. It seems to work fine, but I'm wondering if there is a better/optimized solution. Any help would be appreciated. Thanks!
def find_duplicates(user_info):
results = list()
seen = dict()
for i, user in enumerate(user_info):
first_seen = True
key_info = None
for info in user:
if info in seen:
first_seen = False
key_info = info
break
if first_seen:
results.append([i])
pos = len(results) - 1
else:
index = seen[key_info]
results[index].append(i)
pos = index
for info in user:
seen[info] = pos
return results
Upvotes: 0
Views: 318
Reputation: 21
I think I've reached to an optimized working solution using a graph. Basically, I've created a graph with each node contains its user information and its index. Then, use dfs to traverse the graph and find the duplicates.
Upvotes: 1
Reputation: 41925
I think we can simplify this using sets:
from random import shuffle
def find_duplicates(user_info):
reduced = unreduced = {frozenset(info): [i] for i, info in enumerate(user_info)}
while reduced is unreduced or len(unreduced) > len(reduced):
unreduced = dict(reduced) # make a copy
for identifiers_1, positions_1 in unreduced.items():
for identifiers_2, positions_2 in unreduced.items():
if identifiers_1 is identifiers_2:
continue
if identifiers_1 & identifiers_2:
del reduced[identifiers_1], reduced[identifiers_2]
reduced[identifiers_1 | identifiers_2] = positions_1 + positions_2
break
else: # no break
continue
break
return sorted(sorted(value) for value in reduced.values())
my_input = [ \
("CS229DungeonMaster", "123-4567", "[email protected]"), \
("Doomguy", "[email protected]", "[email protected]"), \
("andrew26", "[email protected]", "[email protected]"), \
("MLGuy42", "[email protected]", "123-4567"), \
]
shuffle(my_input) # shuffle to prove order independence
print(my_input)
print(find_duplicates(my_input))
OUTPUT
> python3 test.py
[('CS229DungeonMaster', '123-4567', '[email protected]'), ('MLGuy42', '[email protected]', '123-4567'), ('andrew26', '[email protected]', '[email protected]'), ('Doomguy', '[email protected]', '[email protected]')]
[[0, 1, 2], [3]]
>
Upvotes: 0